Azure for Beginners - How to copy data from an Azure Storage Account to an Azure SQL Database

If you have been following along with our Azure for Beginners series, so far, you have created an Azure Storage Account where you have uploaded a couple of csv files, you have created an Azure SQL Database, and you have created an Azure Data Factory. So far, what you have learned might have felt a bit disjointed, but in this part of the Azure for Beginners guide we will show you how all of these separate services work together. We will illustrate this by copying data from your Azure Storage Account to your Azure SQL Database through the use of your Azure Data Factory.

Note: The look of Azure tends to change regularly. Your screen might not look exactly like the images in this guide. Some features might have been moved to a different tab, but the general steps should be the same.

Note: please note that the services in Azure are all paid services. The cost is dependent on your usage and configuration. (More information can be found here.) It's advised to do this under the supervision of a database administrator or a data engineer.

Azure Data Factory

After logging into your Azure account, navigate to Azure Data Factory. You can do so by either clicking on the Data factories icon in the shortcut tray on top, or, if it's not in there, by clicking on More services all the way to the right of the shortcut tray and scrolling down to the Analytics header, where you will find Data factories as well. After having loaded the Data Factory area, click on the Data Factory you have created during the previous lesson.

After having entered the instance of your Data Factory, you will be met by a screen full of settings for your Data Factory. For this beginner's guide, these aren't really of interest, because actually using Data Factory isn't done in this interface. The actual work is done in Azure Data Factory Studio, which is a web-based interface for working with Azure Data Factory. It will load in your browser and help you visually design data workflows, connect to data sources, transform and move data between different systems, and schedule and monitor your data operations. Launch Azure Data Factory Studio by clicking the blue button.

In Azure Data Factory Studio, you will see four buttons: Ingest, Orchestrate, Transform data, and Configure SSIS. The latter option is of no use to us in this Azure for Beginners series, since it's about migrating and running your existing data processes in the cloud using Azure Data Factory. The first three buttons, however, are very useful. They more or less reflect the ETL process (Extract, Transform and Load), with Ingest corresponding to Extract, the Transform data step matching Transform, and Orchestrate encompassing both the Load phase and the overall management of the entire process. Since we are trying to get data from our Azure Storage Account into our Azure SQL Database, we will have to ingest data. Click on the Ingest button at the far left.

This will start the Copy Data tool, which will walk us through the necessary steps to ingest our data. We start off by having to choose between Built-in copy task and Metadata-driven copy task. The former is ideal for creating a quick data pipeline, which transfers data from one place to another. The latter is more advanced, as it can handle complex and large-scale data transfers. For our beginner's guide, we will choose 'Built-in copy task'.

Below it, we have the option to schedule our run the pipeline once, to schedule it to run every x hours/days/weeks/months (with or without a specific end date), or to create a tumbling window, which will run pipelines at specific intervals in a non-overlapping manner. Since this pipeline is only for demonstrative purposes, we will leave the task cadence on 'Run once now'. Click on 'next' to go to the next step.

Data Source

Next, we will have to select a type of data source. It is basically asking us where we have stored the data we want to copy to our Azure SQL Database. Seeing as the CSV file we want to copy to our database is stored as a blob in our Azure Storage Account, we will choose 'Azure Blob Storage'.

After having specified the source type, we will have to set up a connection to retrieve the CSV file from our Azure Storage Account. Click on 'New connection' to do so.

First, we will have to make up a name for the connection.

We then have to choose how the connection will be managed. This is done through an Integration Runtime (IR), which is the actual compute infrastructure used by Azure Data Factory which makes the movement of your data happen. By default, this is done through Azure's own AutoResolveIntegrationRuntime, which will set up and manage the connection for you. It will determine how much compute is required within the specific data center, and will use it to perform the processing operations. You also have a choice to create a custom IR, which might suit your situation best if you have special requirements. For our beginner's guide, we will leave it on AutoResolveIntegrationRuntime and use the Account key as Authentication type.

You will have the option to select the storage account you have stored your CSV file in through your Azure subscription, or to enter it manually. If you opt for the latter, you will have to enter your storage account name, storage account key and an endpoint suffix. For this beginner's guide, we will use our Azure subcription. The rest of the settings can be left untouched. Click on the 'Test connection' button at the bottom to see if a connection can be established succesfully. If so, click 'Create'.

Since we have just created a connection to our Azure Storage Account, we now have to specify the location of the specific file we want to copy to our Azure SQL Database. Click on 'Browse' to navigate its location.

Select the file you want to copy to your Azure SQL Database. In our case, we will select shippers.csv. Click 'OK' when you're done.

We also have the options Binary copy, Recursively and Enable partitions discovery available.

• Binary copy - Your files will be copied as-is, which means that they will not be interpreted. Your files are transferred exactly as they are, without any conversion, transformation, or interpretation of the file format. The destination cannot be a relational database.

• Recursively - All the files in the input folder (and subfolders) will be processed recursively.

• Enable partitions discovery - This will add additional columns to your dataset based on the file path. So, if you have organized your files into separate folders, these folder names will be added to your dataset. E.g., if your dataset is located in a folder 'year=2024', a column will be added to your dataset with 'year' as header and '2024' as value(s).

We will leave this untouched, as will we do with the fields below it. Since we are copying one small CSV file, setting the Max concurrent connections won't be necessary. If, however, you are copying data from multiple sources to a single sink, you might need to adjust it so that the copy activity won't exceeds the data store's maximum number of concurrent connections. Setting the start and end time also won't be necessary, as we want to copy the file right away. Click 'next'.

If all went well, Azure will have automatically detected the right file format. In our case, it's a CSV file, so the file format is 'DelimitedText'. If this hasn't been done automatically, you can either manually select the right file format or click on the 'Detect text format' button. You might also have to manually select the right column delimiter and row delimiter. If your source file is a CSV file with a header, make sure to select the 'First row as header' option. We won't select a Compression type (bzip2, gzip, deflate, ZipDeflate, TaGZip or tar), as our file shouldn't be compressed. Click 'next'.

Destination

Like we have done for the data source, we will have to choose a type and set up a connection for the destination as well. We basically have to specify where we want our CSV file to be copied to, and through what connection. For Destination type, select 'Azure SQL Database'. Then, click on 'New connection'.

Similar to setting up a connection to our Azure Storage Account, we will have to make up a name for our new connection to our Azure SQL Database and choose an Integration Runtime (IR). We will also specify a version for our connector. Some people have mentioned getting an error during the actualy copying, which could be solved by choosing 'Legacy'. For now, we will choose 'Recommended', with the option of changing it to 'Legacy' if we run into an error.

Like before, we will have to specify how to select our database. In our case, we will do so through our subscription, after which we can select our server and our database.

Next, we will have to choose an Authentication type, which can either be User Assigned Managed Identity or a System Assigned Managed Identity. Managed Identities are a secure way to authenticate with Azure services without having to mess around with credentials (usernames, passwords and such). Basically, an identity is created which is used to authenticate and to access the Azure SQL Database. User Assigned Managed Identities are created independently of the Azure resource, so that it can be used with multiple Azure resources, and even across resource groups or Azure subscriptions. System Assigned Managed Identities, on the other hand, are bound to the specific Azure resource. So, if it is tied to your Azure SQL Database, and you decide to delete your database, your identity will be deleted as well. For this beginner's guide, we will choose 'System Assigned Managed Identity'.

A Managed identity name will be generated and assigned to us, which you will see below it. You will need to copy it (which can be done by clicking on the icon next to it) and paste it into notepad or something similar, because we will need it to create a connection in Azure Data Studio.

Azure Data Studio

Note: keep Azure Data Factory Studio open where you left it. (Which is on the screen above.)

You will now have to start Azure Data Studio, which is the program you have installed on your computer during the Azure for beginners guide to creating an Azure SQL Database. If you haven't done so, please follow that part of the guide and start Azure Data Studio.

In Azure Data Studio, connect to the server you have created during the Azure SQL Server part of the beginner's guide. You can do so by right-clicking the server in the servers pane, and selecting 'Edit Connection'.

Like you did before, make sure you have entered the right server name. (If you forgot where you can find it, please scroll to the end of our Creating an Azure SQL Database guide.)

After you have succesfully connected to your server, you will see a green circle next to the SQL Database icon.

We will now have to add ourselves as an owner of the database, which might not be as straightfoward. To do so, we will have to create a new query.

To prevent us running into an error, make sure you have the right database selected. This is the database name you have made up during the Creating an Azure SQL Database part of the guide.

Copy-paste the following SQL code into the query field, and change the 'your managed identity name' part with your actual managed identity. This should be one word without any spaces before, in or after it. Don't remove the square brackets. Click 'Run' in the top left corner after you have pasted the query and changed the Managed Identity name.

CREATE USER [your managed identity name] FROM EXTERNAL PROVIDER
ALTER ROLE db_owner ADD MEMBER [your managed identity name]

In our case, it would look like the pictures below. In your case, 'yourdatafactoryname' will obviously be the name you have chosen.

The first line will create a new user in your Azure SQL Database which is associated with your managed identity name. The second line adds said user to the db_owner-role, which grants full control over all aspects of the database, including the ability to perform any action such as creating tables, views, and stored procedures, and managing permissions. This SQL code will make sure that grant Azure Data Factory has the necessary access to copy the data in your CSV to Azure SQL Database.

After running your query successfully, you should see something like the image below. If you ran into a 'Cannot alter the 'db_owner' role, because it does not exist or you do not have permission'-error, you might have selected the wrong database.

When you query has been ran successfully, you will see a message like this:

Now that we have added ourselves as db_owner in Azure SQL Database, keep Azure Data Studio open and connected to your server, but return to Azure Data Factory Studio, which is the one in your browser. We left off at the screen where copied your Managed identity name (see image below), which is where we will pick up and finish the process of copying your file to your Azure SQL Database.

Below it, you will have the option to enable Always Encrypted, which is a feature which is designed to protect sensitive data. You will also have the choice to make Encrypt mandatory or optional, which basically decides if you want the data transfer between your Azure Storage Account and your Azure SQL Database to be encrypted through TLS (Transport Layer Security) encryption. You will also have the option to enable Trust server certificate, which verifies if the server's certificate is trusted, which aims to protect against man-in-the-middle attacks. You can specify the hostname that should be used to validate the server's certificate in the field below it, which can be different from the actual server name you're connecting to. It's also possible to add Additional connection properties, to fine-tune the connection through additions like connectTimeout, which sets the waiting time before a connection times out, or to add Annotations, which can add extra information to your connection configuration.

For our beginner's guide, we will leave all these settings untouched and go on to check if the connection to our Azure SQL Database works. To do so click on 'Test connection' in the bottom right corner. You should see the Connection successful message appear above it. If so, click 'Create' to create the connection.

After having specified the Destination type and having created the Connection, we will now have to specify the destination in our Azure SQL Database. You will see two empty fields in which you will have enter something. The field on the left is the schema, which is a container for (among other objects) your tables. This schema helps you organize and group related objects, like the folders on your computer. Like folders, schemas also come with certain rules. For instance, you cannot have two objects with the same name in a single schema. However, you can have two objects with the same name if they belong to different schemas. So, you can't have two dbo.shipping, but you can have dbo.shipping + etl.shipping.

In practice, schemas often reflect the stage the data is in. For example:

• It often starts with the raw schema, where the data is ingested and stored.

• It might move on to samp (sampling), where a subset of the data is created from the previous stage, to be used for initial testing in dev (Development and Testing).

• Another common schema is staging, where the data from raw or samp is processed further, after which it be loaded into the etl schema, in which the data is transformed, cleaned and logged.

• Processed data is loaded into the dbo (Database Owner) schema, though some companies might prefer a different name, like prod or final.

• Finally, data might be stored in the archive schema, as historical data.

For our beginner's guide, we will use dbo as the schema, as it is the default schema Azure SQL Database (and SQL Server). In the second field we will enter 'shippers', which will become the name of our table in Azure SQL Database. Click 'next' after you have chosen names for your schema and table.

Next, we will have to configure column mappings, which is like a "matching" process between the CSV file in your Azure Storage Account and the (to be created) table in your Azure SQL Database. Seeing as we want to take the columns in our CSV file and copy them to a table in our Azure SQL Database, we basically have to tell Azure Data Factory how the columns in one should be connected to the columns in the other.

• When we take a look at the image below, we see two columns, each made up of three rows. The columns to the left represent our CSV file, while the columns to the right represent the table in our Azure SQL Database.

• In the column to the left, the field on top contains the value shipperid, which is the first column in our CSV. Next to it, we see the value String, which is its data type. (It's basically means that the values in the shipperid column are treated as text.)

• On the same row, but in the column to the right, we also see the value shipperid, which means that the shipperid column from our CSV will be copied to a column in our Azure SQL Database with the same name.

• In the same manner, check if all other columns will copied correctly from your CSV file to your Azure SQL Database.

During this process, you also have the option to only copy certain column from our CSV file, or to change names and data types. However, for our beginner's guide, we will leave everything as-is.

Below it, you have a Pre-copy script input box. This field lets you write a script to clean up the data before you're copying in the new data. For our beginner's guide, we will leave this field empty. Click on 'next' to continue to the Settings part.

Settings

• First, we will have to make up a name for our copy data task. Optionally, you can add a description.

• Below it, you have the option to enable Data consistency verification, which will let Azure compare the data in your CSV file to the data in your Azure SQL Database. If there are inconsistencies found, you will have the option to abort or continue copying, based on your Fault tolerance selection below it.

• You are also able to enable logging, which logs your copied file names in a Copy activity. This can prove to be useful if you suddenly find unexpected files/tables in your destination, or if the files in your destination differ from the ones you copied from the source. You can then check the log to see if, perhaps, another application has made changes.

• Enabling staging allows you to first copy the data to an interim staging storage (Azure Blob storage or Azure Data Lake Storage Gen2), before copying it to the destination. This might prove useful if you want to ingest data from various sources, for security reasons (if you can't open certain ports in your firewall), or if you are copying data from an on-premises data storage to Azure over a slow network. (You can compress your data on-premises, move it to the staging data storage, where it can be decompressed and moved to the destination.)

• Under the advanced heading, you will also be able to specify the Maximum data integration unit. A Data Integration Unit (DIU) is a measure of the compute power (CPU, memory, and network resources) used by Azure Data Factory to perform, among other activities, the copying of your data. Here you can specify the upper limit. The maximum amount of DIUs depends on the source, destination, and size and amount of files you're copying. For example, we are copying a CSV file from our Azure Storage Account (file store) to our Azure SQL Database (non-file store), which allows for 4 DIUs for copying a single file.

• We can also set the Degree of copy parallelism, which sets the amount of parallel threads are used to read from the source and write to the destination during the copying. When we increase it, Azure Data Factory spawns more parallel threads to perform the copying. These threads work simultaneously, reading chunks of data from the source and writing them to the destination.

For our beginner's guide, aside from our Task name, we will leave all these fields empty. Click 'next' to continue.

Summary

You will see an overview of the choices you have made so far. Check them and click 'next' to start the copying.

After Azure has completed the copying, you will see something like the image below.

When you switch to Azure Data Studio (the program not in your browser), after you refresh the tables by clicking on the icon to the right, your new table should be visible.

You will now be able to run your regular T-SQL queries to get the data you want.

You have now successfully copied data from your Azure Storage Account to your Azure SQL Database. Well done! This concludes the How to copy data from an Azure Storage Account to an Azure SQL Database part of our Azure for Beginners guide.