Azure for beginners - Creating an Azure Data Factory

If you have been following along with our Azure for Beginners series, you have already been introduced to Azure Storage Account and Azure SQL Database. Now, we will introduce to you another service, called Azure Data Factory. Similar to the formerly mentioned services, Azure Data Factory is a cloud-based data service provided by Microsoft Azure. Where Azure Data Factory differs from the other two, is that it helps you transport and transform your data, as well as automate and schedule this. For example, with the help of Azure Data Factory we can take the data we have uploaded to our Azure Storage Account, transform it, and move it into our Azure SQL Database. In this part of the Azure for Beginners guide, we will walk you through the steps of creating an Azure Data Factory.

Note: The look of Azure tends to change regularly. Your screen might not look exactly like the images in this guide. Some features might have been moved to a different tab, but the general steps should be the same.

Note: please note that the services in Azure are all paid services. The cost is dependent on your usage and configuration. (More information can be found here.) It's advised to do this under the supervision of a database administrator or a data engineer.

Azure Data Factory

After logging in, navigate to Azure Data Factory. You can do so in various ways. One of the easiest ways would be to click on the 'Create a resource' button in the top left corner of the start screen.

Next, click on Analytics in the sidebar to the left and on Create below Data Factory.

Similar to using other features in the Azure ecosystem, creating an Azure Data Factory will lead us through several tabs. We will start with the Basics tab.

Select your subscription and your resource group. If you don't have a resource group yet, create one through the 'Create new' link underneath it.

We then have to name our new Data Factory instance. Similar to creating an Azure Storage Account and an Azure SQL Database, we also have to choose a region. Finally, we will also have to choose a Data Factory version. A the moment of writing, there have been only two versions released, with V2 being an enhanced version of V1. If you have a choice, we'd recommend choosing the latest version.

Git Configuration Tab

Next, we will move on to the Git configuration tab. Git is a version control system that tracks changes over time. When you set up Git integration, you connect your Azure Data Factory to a Git repository, which is like a folder in the cloud where you can store project files (on the structure and organization of your Data Factory), as well as the entire revision history. When you start to work in Azure Data Factory, the changes you make will automatically be saved (as JSON files) in the connected Git repository, which lets you do version control, collaborate with others more easily, and enables you to revert to a previous state if needed. You can also create separate branches in Git to work on different parts of your project without affecting the main (production) branch.

By deselecting 'Configure Git later', you will have the option to select and configure a repository (either Azure DevOps or GitHub).

For this beginner's guide we will select the Configure Git later feature, which means that we will not set up a Git repository and move on to the Networking tab.

Networking Tab

We move on to the Networking tab where we can define network access to the Data Factory.

You will have the option to enable Managed Virtual Network on the default AutoResolveIntegrationRuntime. Enabling this feature creates a private, isolated network space managed by Azure Data Factory, so all your data movements and transformations performed by Azure Data Factory are done within within that nerwork environment. This is basically an extra security layer. For our guide we will leave it unchecked.

Next, we will have to choose how to let Azure Data Factory access our data sources. This can be done through either a public or private endpoint. The former will use the public internet to connect to your data sources, while the latter will use a private IP address within your managed virtual network, which makes it more secure as your data doesn't travel over the public internet. For iour beginner's guide, we will choose 'Public endpoint'.

Advanced Settings Tab

The Advanced tab lets us configure enable encryption using a Customer Managed Key. By default, your data is encrypted with Microsoft-managed keys. Here you will have the option to use your own keys, which will need to be stored in a key vault in the same region. For our beginner's guide, we will leave this option unchecked and move on to the Tags tab.

Tags Tab

Similar to what we have seen during the Storage Account and SQL Database creation, the Tags tab lets us set tags to categorize our resources. Tags are applied to our resource, in this case our Data Factory. If we are running multiple Data Factories, with the help of these tags we can quickly and efficiently find all resources with specific tags. Tags are stored in key-value pairs. For example, if we were to enter 'Department' in the Name field, and 'Finance' in the Value field, they will form a key-value pair. Or how about, 'Project' as the Name and 'YourProjectName123' as the Value. If we would want to track costs for this project for the Finance department, through these tags we can now filter on 'Department = Finance' and 'Project = YourProjectName123'. For our beginner's guide, we will skip the tags and move on to the Review + create tab.

Review + Create Tab

Finally we have arrived at the Review + create tab, which is an overview of all the choices we have made thusfar.

After having checked your answers, click 'Create' to finish. Azure will then start with the deployment of your Data Factory. (This usually takes around 3 or 4 minutes.) If it is successful, you will see something like the image below.

You have now created an Azure Data Factory. To open it, click on 'Go to resource'.