Azure for Beginners - Orchestrating Pipelines

In this installment of our Azure for Beginners series, we will build upon what we have created in the previous part of the guide (in which we have created a data flow and pipeline) and create a new pipeline to ingest data (from Google BigQuery) and copy it to our Azure Storage Account, from which we will send it to our existing pipeline, where we will join it with the data we already had in it.

Note: The look of Azure tends to change regularly. Your screen might not look exactly like the images in this guide. Some features might have been moved to a different tab, but the general steps should be the same.

Note: please note that the services within Azure are all paid services. The cost is dependent on your usage and configuration. (More information can be found here.) It's advised to do this under the supervision of a database administrator or a data engineer.

Google BigQuery - Uploading Our Data

For demonstrative purposes, rather than ingesting a data set from our Azure Data Storage Account, we will use a data set saved in Google BigQuery, which is Google's data warehouse solution, and copy it to our Azure Storage Account as a CSV file.

In order to upload data to Google BigQuery, visit the Google Cloud Console website. Here, you will either have to login with an existing account or create a new one. Once you have logged in, navigate to BigQuery Studio. Since we're only using a small data set, a free account will suffice. This means you won't have to enter your creditcard info. (Your account will default to 'sandbox' mode, which will work fine.)

It's possible that a new project will have been automatically created for you. If not, you will have to manually create a new project. This can be done in several ways, one of which is clicking 'Select a project' and 'new project'.

Make up a name for your new project and click the 'create' button. Either write this project name down for yourself or copy-paste it somewhere, because you will need it later when we will create the connection between BigQuery and Azure.

Next, we will have to create a new data set within our new project. Click on the three dots next to your new project and select 'Create dataset' from the menu.

Give your new data set a name and select the region(s) in which you want your data set to be located. Click the 'create dataset' button when you're finished.

Next, we will create a table which will contain the data from our orders_details.csv file. Click on the three dots next to the data set you have just created and select 'Create table'. (You might have to unfold the menu first, by clicking the triangle next to your new project.)

Now it's time to upload our data set to Google BigQuery.

• Select 'Upload' in the 'Create table from' field and click 'Browse' to navigate to the orders_details.csv file on your computer. The file format should automatically change to CSV.

• Under the Table header, make up a name for your new table. We have opted for the name 'orders_details'.

• If you want BigQuery to automatically detect the schema of your CSV file, be sure to select 'Auto detect' under the Schema header.

• Click the 'Create Table' button when you're finished.

If all went well, when you click on the little triangle next to your data set, you new table should be visible. (If not, refresh the page.) Select the table by clicking on it.

After having selected your new table, you should see more information on your table. On top, you will be able to navigate to several tabs, each containing different information.

When we click the 'preview' tab, the preview of our table shows that it contains the same data as our orders_details.csv does, which means that our upload has been successful.

Google BigQuery - Getting a Key

In order to get our data from Google BigQuery to Azure Data Factory, we will have to create a connection between the two. This can be done in several ways. One of the simplest ways is to download a JSON file from BigQuery and upload it to Azure Data Factory.

Extend the menu by clicking the hamburger button (the three horizontal lines) in the top left of your screen. Then navigate to 'IAM & Admin' and click on 'Service Accounts'.

Make sure your project is selected in the drop-down menu above and click the '+ Create Service Account' button.

Make up a name for your new service account and click 'Create and Continue'.

Optionally, you can select a role for this service account. For our beginner's guide, we will choose 'Basic' and 'Owner'.

Click 'Continue'.

Optionally, you can grant other users access to this service account, but for this beginner's guide we will skip it and click the 'Done' button.

If all went well, you should be able to see your new service account listed. Click on the blue link to see details on your service account.

Click the 'Keys' tab on top.

Next, click the 'Add Key' button and select 'Create new key'.

Under the Key type heading, select 'JSON', and click 'Create'. This will automatically download a JSON file to your computer. Make sure you know where it is downloaded, because later on you will need to upload the file to your Azure Data Factory.

You should be able to see your key listed and active. Now, it's time to head back to Azure.

Log in your Azure Data Factory

Next, we will switch back to Azure Data Factory. Like you have done many times before, log in to your Azure account and navigate to Azure Data Factory. You can do this either by clicking on the Data factories icon in the shortcut tray on top, or, if it's not in there, by clicking on More services all the way to the right of the shortcut tray and scrolling down to the Analytics header, where you will find Data factories as well. After having loaded the Data Factory environment, enter the Data Factory you have created during a previous lesson.

Like you have done in the previous lesson, launch Azure Data Factory Studio by clicking the 'Launch studio' button.

Instead of clicking on 'Transform data', this time we will click on 'Orchestrate'.

Orchestrating with Azure Data Factory

What exactly is orchestration in Azure Data Factory? In ADF, orchestration is all about instructing services within Azure what to do and when to do it. Orchestration comes down to coordinating and managing the execution of different tasks or activities within a data pipeline. Orchestrating isn't about performing the actual activities (like transforming the data), but rather about instructing specific services to do so. It lets us makes sure that everything is executed in the right order and at the right time, often involving complex control flow, scheduling, and dependency management.

As mentioned in the previous article, Azure Data Factory lets us connect to various data sources by creating an object called a 'Linked Service.' A Linked Service is essentially a set of configurations settings which defines the connection to our data source. Once we have defined a linked service to connect to, let's say, a file in our Azure Storage Account or a table in our SQL database, Azure Data Factory can access our data through a 'Dataset' object. A Dataset is a representation of the data structure within our storage account or database.

These Datasets can be used by a process within Azure Data Factory called an 'Activity', which are the actions you can perform on your data. (E.g., transforming your data).

In the previous part of our Azure for Beginner's guide we have created a so-called 'pipeline', which basically is a logical grouping of Activities. By grouping Activities together into a pipeline, we can define the sequence in which these Activities should be performed, execute them on a schedule, set up triggers to initiate the process, etc., yet we can still monitor their execution as a single unit. For example, consider a scenario in which you need to extract data from a source, transform the data, and then load it into a SQL database. To accomplish this in Azure Data Factory, you would create a pipeline in which you would use a Copy Data Activity to move the data to the cloud, followed by a Data Flow to perform the necessary transformations, and finally, move your data to your SQL database through a sink or Copy Data Activity. Orchestration ensures that each of these steps happens in the correct order and under the right conditions.

Connect your BigQuery table to Azure Data Factory

We will start off by creating a new pipeline.

Seeing as we will be retrieving a table from Google BigQuery, which is essentially just our orders_details.csv file uploaded to BigQuery, we will name our pipeline 'OrdersDetailsRetrievalPipeline'.

Next, we will drag a Copy Data Activity onto our canvas (also known as 'graph'), which we will use to copy data from our BigQuery account to Azure Data Factory.

With the Copy Data Activity selected, you will see the configuration settings appear below the canvas. If needed, you can also give the Activity a new name. (We have left it as is.) Click on the 'Source' tab.

To add a new source dataset, click on the + sign next to 'New'.

Select the Google BigQuery connector.

Make up a fitting name for your data set. Seeing as we're going to get the orders_details.csv file from BigQuery, we have named the data set 'BigQueryOrdersDetailsRetrieval'. Next, click on the caret to select a Linked Service and select '+ New' in the drop-down menu.

The next screen has quite a lot of fields. These are the most important ones:

• Under Name you will have to give your new linked service a name. We have opted for the name 'BigQueryOrdersDetails'.

• Next, we will have to specify how to connect to the table we have created in Google BigQuery. In our case, we will connect via 'AutoResolveIntegrationRuntime'.

• Under Project ID, fill in the project name you have used in BigQuery. (The one we have asked you to write down or copy-paste into notepad.) Make sure to remove leading and/or trailing spaces.

• Choose 'Service Authentication' as your Authentication type.

• Select Key file as the method by which you want to authenticate.

• Below it we will have to upload the JSON file we have downloaded from BigQuery. Click 'Choose file' to browse to the file on your computer and select said file.

To check if the connection between your BigQuery table and Azure Data Factory is working as intended, click on 'Test connection' in the lower right corner. After testing, it should say 'Connection successful'. If so, click the 'Create' button.

If the connection is working, under the Table name header you will be able to select the table we have created in BigQuery. (It might take a little while for the name to become visible. If it doesn't, click the refresh button to the right.)

Next, we move on to the Sink tab. Click 'New' to create a new sink.

Select 'Azure Blob Storage'.

Since we want to save it as a CSV file, select 'DelimitedText'.

Enter a name, select a linked service and select the file path you want the CSV to be copied to. Don't forget the 'First row as header' checkmark. Under the Import schema header, we will also select the 'From connection/store' option. Click 'OK' when you're finished.

Make sure to choose the .csv file extension if you are copying the data to a CSV file.

For our purposes, we can skip the Mapping, Settings and User properties tabs.

Next, we will connect the pipeline we have made in a previous lesson to our new Copy Data Activity. On the Activity, click the blue button with the white arrow.

After clicking the blue button, a menu with Activities will appear. Scroll down and select 'Execute Pipeline'.

The 'Execute Pipeline' Activity will appear on the canvas, attached to your Copy Data Activity. Make sure to have it selected by clicking on the activity (so it turns blue).

With the Execute Pipeline Activity selected, you will see tabs with its settings below the canvas.

Next, navigate to the Settings tab. Here we need to select the pipeline we have created during the previous lesson.

If we go back to the canvas and select the Copy Data Activity, we see little squares attached to the Activity. These are conditional paths where we can connect Activities which will be executed based on the outcome of the current Activity:

• Grey arrow (upon skip) - this path will be executed if the current activity itself didn't run.

• Green checkmark (upon success) - this path will be executed if the current activity has succeeded. This is the default pass.

• Red cross (upon failure) - this path will be executed if the current activity has failed.

• Blue arrow (upon completion) - this path will be executed after the current activity has completed, regardless if it has succeeded or not.

These conditional paths allow us to use conditional logic. For example, when we look at the connection between our Copy Data Activity, and our execute pipeline Activity, we see that they're connected through the green checkmark attached to the Copy Data Activity. This means that the Execute Pipeline Activity will only be run if the Copy Data Activity has succeeded.

Since it would be nice to get a warning if the Copy Data Activity fails, we will attach an Activity to the red cross. With the Copy Data Activity selected, click the blue button with the white arrow.

Select the 'Fail' Activity from the menu.

Seeing as the upon success path (green checkmark) is the default path, the Fail Activity will most likely also be connected to this path rather than to the upon failure path (red cross).

If that's the case, delete the connection (Right-click + delete) and connect the Activities through the red cross path.

With the Fail Activity selected, you can also give the Activity a name.

On the Settings tab, we can enter the message we get upon failure, as well as an error code. Feel free to choose a message and error code which suits your situation best.

When you're finished, the canvas should look like this.

Next, validate the changes we have made and publish them.

Trigger the pipeline, so that it runs once.

Altering the Previously Made Dataflow

In the pipeline we have just created we have ingested data from Google BigQuery, copied it to our Azure Storage Account as orders_details.csv, and (upon success) executed the pipeline we had made in the previous lesson. We will now take that data and join it with the data in our previously made Dataflow, which was in our previously made pipeline. To do so, select the data flow in the side bar to the left of the canvas.

Click the 'Add Source' button to add a source for the data set we have copied from BigQuery.

Give the source a name, select a source type and click the + sign to add a new data set.

Seeing as we have copied the data from BigQuery to our Azure Storage Account, we will select 'Azure Blob Storage'.

We have saved the data as a CSV file, which means we will choose 'DelimitedText' as the format type.

By now, you are probably somewhat familiar with these steps:

• Give the Dataset a name and select a linked service.

• Seeing as we have ran the pipeline we just made (by using 'Trigger now'), as a result, there should now be a new CSV file saved in the blob storage in our Azure Storage Account. Navigate to it by clicking the folder icon under the File path header (to the right).

• Make sure to place a checkmark next to First row as header.

• Select 'From connection/store' under the Import schema header.

• Click the 'OK' button when you're finished.

The new source (in our case OrdersDetailsStream) should now be visible on the canvas. Next, we will join the data from our source (which contains the orders_details.csv data we have copied from BigQuery) to the joined data we already had in the Dataflow, which is the join between the OrdersStream (orders.csv) and the CustomersStream (customers.csv).

Click the + sign next to OrdersCustomersJoin.

In the menu which will appear, select 'Join'.

With the new join stream selected, below the canvas you'll have to select the settings for the join:

• Give the join a name.

• Optionally, you can enter a description.

• Select the left and right stream. We have chosen 'OrdersCustomersJoin' and 'OrdersDetailsStream' respectively.

• Choose a join type. We have chosen for a left outer join, which means that all of the rows of the OrdersCustomersJoin will be retained, and only certain rows of the OrdersDetailsStream. The reason for this is that there might not be details on all of the orders.

• We have left 'fuzzy matching' unchecked. If the values in the columns you want to base your join on (the join condition) might not match for a full 100%, you can use this option. E.g., if you would like to match customer id '001' from your left stream to customer id '01' from your right stream, fuzzy matching will do so. When enabled, you can choose a similarity threshold, which is a percentage (60-100%) of how similar the values in the left and right stream need to be.

• Choose a join condition, which are the columns of the left and right stream which need to be inr elation to each other. In our case, we will use the '==' operator. This means that we want the rows returned where the values in both columns match. Seeing as we are creating a left outer join, we want to return all of the rows of the left stream and only the rows of the right stream where the values of the orderid columns match.

When we take a look at the Data preview tab, we see that the join works as intended.

After the new join has been made, the canvas should look like this.

Although we have 28 columns in total (23 from the OrdersCustomersJoin we have made in the previous lesson, to which we have added 5 columns from our OrdersDetailsStream), it only shows 26 columns. The reason for this is that we have multiple columns with the same name, which is a result of the joins we have made.

With the OrdersCustomersToDB sink selected, in the sink properties (below the canvas), navigate to the Mapping tab. Like we have seen in the previous lesson, the input columns with identical names will have an @ in it. E.g., when we see an input column like OrdersStream@orderid, this means that there will also be orderid column coming from another stream. In our case, somewhere in the list there will also be OrdersDetailsStream@orderid. In this case, not having these columns wouldn't be a problem, but if you would like to keep all of the columns, simply rename one of the duplicate columns. Do the same for the customerid columns.

This will leave you with all 28 columns.

Finally, validate and publish the changes we have made.

Run the Pipeline(s)

Now that we're finished building and altering the pipeline and Dataflow, it's time to run them.

In the sidebar to the left, select the pipeline we have created at the beginning of this lesson, which in our case is OrdersDetailsRetrievalPipeline.

Before we trigger the pipeline, make sure to have your Azure SQL Database active. In our case, this means connecting to Azure Data Studio. (For a how-to: check this part of the guide. Scroll down to the Azure Data Studio section.) Next, trigger the pipeline.

When we switch over to Azure Data Studio and refresh the tables (like we have done before), we can see that the columns from the orders_details.csv have been successfully joined with the other data.

With this, we have come to the end of this part of the Azure for Beginners guide. Great job!