Azure for beginners - Banner

Azure for Beginners - Orchestrating Pipelines

In this installment of our Azure for Beginners series, we will build upon what we have created in the previous part of the guide (in which we have created a data flow and pipeline) and create a new pipeline to ingest data (from Google BigQuery) and copy it to our Azure Storage Account, from which we will send it to our existing pipeline, where we will join it with the data we already had in it.
Note: The look of Azure tends to change regularly. Your screen might not look exactly like the images in this guide. Some features might have been moved to a different tab, but the general steps should be the same.
Note: please note that the services within Azure are all paid services. The cost is dependent on your usage and configuration. (More information can be found here.) It's advised to do this under the supervision of a database administrator or a data engineer.

Google BigQuery - Uploading Our Data

For demonstrative purposes, rather than ingesting a data set from our Azure Data Storage Account, we will use a data set saved in Google BigQuery, which is Google's data warehouse solution, and copy it to our Azure Storage Account as a CSV file. 
In order to upload data to Google BigQuery, visit the Google Cloud Console website. Here, you will either have to login with an existing account or create a new one. Once you have logged in, navigate to BigQuery Studio. Since we're only using a small data set, a free account will suffice. This means you won't have to enter your creditcard info. (Your account will default to 'sandbox' mode, which will work fine.)
Azure for Beginners - Pipeline Orchestration - BQ - 0.0.1. - BigQuery Studio_1 SP
It's possible that a new project will have been automatically created for you. If not, you will have to manually create a new project. This can be done in several ways, one of which is clicking 'Select a project' and 'new project'.
Azure for Beginners - Pipeline Orchestration - BQ - 0.1. - Select a project SP
Azure for Beginners - Pipeline Orchestration - BQ - 0.2. - New Project SP
Make up a name for your new project and click the 'create' button. Either write this project name down for yourself or copy-paste it somewhere, because you will need it later when we will create the connection between BigQuery and Azure.
Azure for Beginners - Pipeline Orchestration - BQ - 0.2.1. - New Project SP
Next, we will have to create a new data set within our new project. Click on the three dots next to your new project and select 'Create dataset' from the menu.
Azure for Beginners - Pipeline Orchestration - BQ - 0.3. - Create Data set SP
Give your new data set a name and select the region(s) in which you want your data set to be located. Click the 'create dataset' button when you're finished.
Azure for Beginners - Pipeline Orchestration - BQ - 0.4 - Dataset Name SP
Next, we will create a table which will contain the data from our orders_details.csv file. Click on the three dots next to the data set you have just created and select 'Create table'. (You might have to unfold the menu first, by clicking the triangle next to your new project.)
Azure for Beginners - Pipeline Orchestration - BQ - 0.5 - Create Table SP
Now it's time to upload our data set to Google BigQuery.
• Select 'Upload' in the 'Create table from' field and click 'Browse' to navigate to the orders_details.csv file on your computer. The file format should automatically change to CSV.
• Under the Table header, make up a name for your new table. We have opted for the name 'orders_details'.
• If you want BigQuery to automatically detect the schema of your CSV file, be sure to select 'Auto detect' under the Schema header.
• Click the 'Create Table' button when you're finished.
Azure for Beginners - Pipeline Orchestration - BQ - 0.6.2 - Create Table settings SP
If all went well, when you click on the little triangle next to your data set, you new table should be visible. (If not, refresh the page.) Select the table by clicking on it.
Azure for Beginners - Pipeline Orchestration - BQ - 0.7.1 - Table Created SP
After having selected your new table, you should see more information on your table. On top, you will be able to navigate to several tabs, each containing different information.
Azure for Beginners - Pipeline Orchestration - BQ - 0.8. - Schema SP
When we click the 'preview' tab, the preview of our table shows that it contains the same data as our orders_details.csv does, which means that our upload has been successful.

Google BigQuery - Getting a Key

In order to get our data from Google BigQuery to Azure Data Factory, we will have to create a connection between the two. This can be done in several ways. One of the simplest ways is to download a JSON file from BigQuery and upload it to Azure Data Factory.
Extend the menu by clicking the hamburger button (the three horizontal lines) in the top left of your screen. Then navigate to 'IAM & Admin' and click on 'Service Accounts'.
Azure for Beginners - Pipeline Orchestration - BQ - Key - 1. - Service Accounts SP
Make sure your project is selected in the drop-down menu above and click the '+ Create Service Account' button.
Azure for Beginners - Pipeline Orchestration - BQ - Key - 1.1 - Create Service Account SP
Make up a name for your new service account and click 'Create and Continue'.
Azure for Beginners - Pipeline Orchestration - BQ - Key - 4. - Service Account Name SP
Optionally, you can select a role for this service account. For our beginner's guide, we will choose 'Basic' and 'Owner'.
Azure for Beginners - Pipeline Orchestration - BQ - Key - 5. - Role selection SP
Click 'Continue'.
Azure for Beginners - Pipeline Orchestration - BQ - Key - 10. - Continue SP
Optionally, you can grant other users access to this service account, but for this beginner's guide we will skip it and click the 'Done' button.
If all went well, you should be able to see your new service account listed. Click on the blue link to see details on your service account.
Azure for Beginners - Pipeline Orchestration - BQ - Key - 12. - Click your service account SP
Click the 'Keys' tab on top.
Azure for Beginners - Pipeline Orchestration - BQ - Key - 12.1 -Details tab SP
Next, click the 'Add Key' button and select 'Create new key'.
Azure for Beginners - Pipeline Orchestration - BQ - Key - 13 - Keys tab SP
Azure for Beginners - Pipeline Orchestration - BQ - Key - 14 - Create new key SP
Under the Key type heading, select 'JSON', and click 'Create'. This will automatically download a JSON file to your computer. Make sure you know where it is downloaded, because later on you will need to upload the file to your Azure Data Factory.
You should be able to see your key listed and active. Now, it's time to head back to Azure.
Azure for Beginners - Pipeline Orchestration - BQ - Key - 16 - JSON download SP

Log in your Azure Data Factory

Next, we will switch back to Azure Data Factory. Like you have done many times before, log in to your Azure account and navigate to Azure Data Factory. You can do this either by clicking on the Data factories icon in the shortcut tray on top, or, if it's not in there, by clicking on More services all the way to the right of the shortcut tray and scrolling down to the Analytics header, where you will find Data factories as well. After having loaded the Data Factory environment, enter the Data Factory you have created during a previous lesson.
Like you have done in the previous lesson, launch Azure Data Factory Studio by clicking the 'Launch studio' button.
Azure for beginners - Data Storage to SQL Database - 0.1 Start Azure Data Factory Studio SP button
Instead of clicking on 'Transform data', this time we will click on 'Orchestrate'.
Azure for Beginners - Pipeline Orchestration - 0.1.1 - Ingest data settings

Orchestrating with Azure Data Factory

What exactly is orchestration in Azure Data Factory? In ADF, orchestration is all about instructing services within Azure what to do and when to do it. Orchestration comes down to coordinating and managing the execution of different tasks or activities within a data pipeline. Orchestrating isn't about performing the actual activities (like transforming the data), but rather about instructing specific services to do so. It lets us makes sure that everything is executed in the right order and at the right time, often involving complex control flow, scheduling, and dependency management.
As mentioned in the previous article, Azure Data Factory lets us connect to various data sources by creating an object called a 'Linked Service.' A Linked Service is essentially a set of configurations settings which defines the connection to our data source. Once we have defined a linked service to connect to, let's say, a file in our Azure Storage Account or a table in our SQL database, Azure Data Factory can access our data through a 'Dataset' object. A Dataset is a representation of the data structure within our storage account or database.
These Datasets can be used by a process within Azure Data Factory called an 'Activity', which are the actions you can perform on your data. (E.g., transforming your data).
Azure for Beginners - Pipeline Orchestration - 0.3 - Activities settings
In the previous part of our Azure for Beginner's guide we have created a so-called 'pipeline', which basically is a logical grouping of Activities. By grouping Activities together into a pipeline, we can define the sequence in which these Activities should be performed, execute them on a schedule, set up triggers to initiate the process, etc., yet we can still monitor their execution as a single unit. For example, consider a scenario in which you need to extract data from a source, transform the data, and then load it into a SQL database. To accomplish this in Azure Data Factory, you would create a pipeline in which you would use a Copy Data Activity to move the data to the cloud, followed by a Data Flow to perform the necessary transformations, and finally, move your data to your SQL database through a sink or Copy Data Activity. Orchestration ensures that each of these steps happens in the correct order and under the right conditions.

Connect your BigQuery table to Azure Data Factory

We will start off by creating a new pipeline.
Azure for Beginners - Pipeline Orchestration - 1. Create new pipeline settings
Seeing as we will be retrieving a table from Google BigQuery, which is essentially just our orders_details.csv file uploaded to BigQuery, we will name our pipeline 'OrdersDetailsRetrievalPipeline'.
Azure for Beginners - Pipeline Orchestration - 1.1.1 Name your pipeline settings SP
Next, we will drag a Copy Data  Activity onto our canvas (also known as 'graph'), which we will use to copy data from our BigQuery account to Azure Data Factory.
Azure for Beginners - Pipeline Orchestration - 1.2.1 Copy Data settings
With the Copy Data Activity selected, you will see the configuration settings appear below the canvas. If needed, you can also give the Activity a new name. (We have left it as is.) Click on the 'Source' tab.
Azure for Beginners - Pipeline Orchestration - 1.3 Source settings
To add a new source dataset, click on the + sign next to 'New'.
Azure for Beginners - Pipeline Orchestration - 2. New source settings
Select the Google BigQuery connector.
Make up a fitting name for your data set. Seeing as we're going to get the orders_details.csv file from BigQuery, we have named the data set 'BigQueryOrdersDetailsRetrieval'. Next, click on the caret to select a Linked Service and select '+ New' in the drop-down menu.
Azure for Beginners - Pipeline Orchestration - 5.1 Name and Linked Service Settings SP
The next screen has quite a lot of fields. These are the most important ones:
• Under Name you will have to give your new linked service a name. We have opted for the name 'BigQueryOrdersDetails'.
• Next, we will have to specify how to connect to the table we have created in Google BigQuery. In our case, we will connect via 'AutoResolveIntegrationRuntime'.
• Under Project ID, fill in the project name you have used in BigQuery. (The one we have asked you to write down or copy-paste into notepad.) Make sure to remove leading and/or trailing spaces.
• Choose 'Service Authentication' as your Authentication type.
• Select Key file as the method by which you want to authenticate.
• Below it we will have to upload the JSON file we have downloaded from BigQuery. Click 'Choose file' to browse to the file on your computer and select said file.
Azure for Beginners - Pipeline Orchestration - 6. Adding BigQuery - total - marked 1.3 SP
To check if the connection between your BigQuery table and Azure Data Factory is working as intended, click on 'Test connection' in the lower right corner. After testing, it should say 'Connection successful'. If so, click the 'Create' button.
Azure for Beginners - Pipeline Orchestration - 7. Testing Connection SP
If the connection is working, under the Table name header you will be able to select the table we have created in BigQuery. (It might take a little while for the name to become visible. If it doesn't, click the refresh button to the right.)
Azure for Beginners - Pipeline Orchestration - 9. Select table name SP
Next, we move on to the Sink tab. Click 'New' to create a new sink.
Azure for Beginners - Pipeline Orchestration - 11. Sink selection SP
Select 'Azure Blob Storage'.
Azure for Beginners - Pipeline Orchestration - 13. Azure Blob Storage SP
Since we want to save it as a CSV file, select 'DelimitedText'.
Enter a name, select a linked service and select the file path you want the CSV to be copied to. Don't forget the 'First row as header' checkmark. Under the Import schema header, we will also select the 'From connection/store' option. Click 'OK' when you're finished.
Azure for Beginners - Pipeline Orchestration - 18.1 Sink properties SP
Make sure to choose the .csv file extension if you are copying the data to a CSV file.
Azure for Beginners - Pipeline Orchestration - 19. csv file extension SP
For our purposes, we can skip the Mapping, Settings and User properties tabs.
Next, we will connect the pipeline we have made in a previous lesson to our new Copy Data Activity. On the Activity, click the blue button with the white arrow.
Azure for Beginners - Pipeline Orchestration - 23.1.0 Add step SP
After clicking the blue button, a menu with Activities will appear. Scroll down and select 'Execute Pipeline'.
Azure for Beginners - Pipeline Orchestration - 25. Execute pipeline SP
The 'Execute Pipeline' Activity will appear on the canvas, attached to your Copy Data Activity. Make sure to have it selected by clicking on the activity (so it turns blue).
With the Execute Pipeline Activity selected, you will see tabs with its settings below the canvas.
Azure for Beginners - Pipeline Orchestration - 26. General tab SP
Next, navigate to the Settings tab. Here we need to select the pipeline we have created during the previous lesson.
If we go back to the canvas and select the Copy Data Activity, we see little squares attached to the Activity. These are conditional paths where we can connect Activities which will be executed based on the outcome of the current Activity:
Grey arrow (upon skip) - this path will be executed if the current activity itself didn't run.
Green checkmark (upon success) - this path will be executed if the current activity has succeeded. This is the default pass.
Red cross (upon failure) - this path will be executed if the current activity has failed.
Blue arrow (upon completion) - this path will be executed after the current activity has completed, regardless if it has succeeded or not.
These conditional paths allow us to use conditional logic. For example, when we look at the connection between our Copy Data Activity, and our execute pipeline Activity, we see that they're connected through the green checkmark attached to the Copy Data Activity. This means that the Execute Pipeline Activity will only be run if the Copy Data Activity has succeeded.
Since it would be nice to get a warning if the Copy Data Activity fails, we will attach an Activity to the red cross. With the Copy Data Activity selected, click the blue button with the white arrow.
Azure for Beginners - Pipeline Orchestration - 29. Add fail SP
Select the 'Fail' Activity from the menu.
Azure for Beginners - Pipeline Orchestration - 30. Fail SP
Seeing as the upon success path (green checkmark) is the default path, the Fail Activity will most likely also be connected to this path rather than to the upon failure path (red cross).
Azure for Beginners - Pipeline Orchestration - 31. Fail - wrongly connected SP
If that's the case, delete the connection (Right-click + delete) and connect the Activities through the red cross path.
Azure for Beginners - Pipeline Orchestration - 32. Fail - correctly connected SP
With the Fail Activity selected, you can also give the Activity a name.
Azure for Beginners - Pipeline Orchestration - 33. General tab SP
On the Settings tab, we can enter the message we get upon failure, as well as an error code. Feel free to choose a message and error code which suits your situation best.
When you're finished, the canvas should look like this.
Azure for Beginners - Pipeline Orchestration - 35. Overview SP
Next, validate the changes we have made and publish them.
Azure for Beginners - Pipeline Orchestration - 35. Validate and Publish SP
Trigger the pipeline, so that it runs once.

Altering the Previously Made Dataflow

In the pipeline we have just created we have ingested data from Google BigQuery, copied it to our Azure Storage Account as orders_details.csv, and (upon success) executed the pipeline we had made in the previous lesson. We will now take that data and join it with the data in our previously made Dataflow, which was in our previously made pipeline. To do so, select the data flow in the side bar to the left of the canvas.
Azure for Beginners - Pipeline Orchestration - 39. Select dataflow SP
Click the 'Add Source' button to add a source for the data set we have copied from BigQuery.
Give the source a name, select a source type and click the + sign to add a new data set.
Azure for Beginners - Pipeline Orchestration - 40. OrdersDetailsStream SP
Seeing as we have copied the data from BigQuery to our Azure Storage Account, we will select 'Azure Blob Storage'.
Azure for Beginners - Pipeline Orchestration - 13. Azure Blob Storage SP
We have saved the data as a CSV file, which means we will choose 'DelimitedText' as the format type.
By now, you are probably somewhat familiar with these steps:
• Give the Dataset a name and select a linked service.
• Seeing as we have ran the pipeline we just made (by using 'Trigger now'), as a result, there should now be a new CSV file saved in the blob storage in our Azure Storage Account. Navigate to it by clicking the folder icon under the File path header (to the right).
• Make sure to place a checkmark next to First row as header.
• Select 'From connection/store' under the Import schema header.
• Click the 'OK' button when you're finished.
Azure for Beginners - Pipeline Orchestration - 43. OrdersDetails properties SP
Azure for Beginners - Pipeline Orchestration - 44.1 Source settings SP
The new source (in our case OrdersDetailsStream) should now be visible on the canvas. Next, we will join the data from our source (which contains the orders_details.csv data we have copied from BigQuery) to the joined data we already had in the Dataflow, which is the join between the OrdersStream (orders.csv) and the CustomersStream (customers.csv).
Click the + sign next to OrdersCustomersJoin.
In the menu which will appear, select 'Join'.
Azure for Beginners - Pipeline Orchestration - 45 - Select JOIN SP
With the new join stream selected, below the canvas you'll have to select the settings for the join:
• Give the join a name.
• Optionally, you can enter a description.
• Select the left and right stream. We have chosen 'OrdersCustomersJoin' and 'OrdersDetailsStream' respectively.
• Choose a join type. We have chosen for a left outer join, which means that all of the rows of the OrdersCustomersJoin will be retained, and only certain rows of the OrdersDetailsStream. The reason for this is that there might not be details on all of the orders.
• We have left 'fuzzy matching' unchecked. If the values in the columns you want to base your join on (the join condition) might not match for a full 100%, you can use this option. E.g., if you would like to match customer id '001' from your left stream to customer id '01' from your right stream, fuzzy matching will do so. When enabled, you can choose a similarity threshold, which is a percentage (60-100%) of how similar the values in the left and right stream need to be.
• Choose a join condition, which are the columns of the left and right stream which need to be inr elation to each other. In our case, we will use the '==' operator. This means that we want the rows returned where the values in both columns match. Seeing as we are creating a left outer join, we want to return all of the rows of the left stream and only the rows of the right stream where the values of the orderid columns match.
Azure for Beginners - Pipeline Orchestration - 45.1 Join settings SP
When we take a look at the Data preview tab, we see that the join works as intended.
Azure for Beginners - Pipeline Orchestration - 47. Data preview SP
After the new join has been made, the canvas should look like this.
Although we have 28 columns in total (23 from the OrdersCustomersJoin we have made in the previous lesson, to which we have added 5 columns from our OrdersDetailsStream), it only shows 26 columns. The reason for this is that we have multiple columns with the same name, which is a result of the joins we have made.
Azure for Beginners - Pipeline Orchestration - 49.1 Before mapping SP
With the OrdersCustomersToDB sink selected, in the sink properties (below the canvas), navigate to the Mapping tab. Like we have seen in the previous lesson, the input columns with identical names will have an @ in it. E.g., when we see an input column like OrdersStream@orderid, this means that there will also be orderid column coming from another stream. In our case, somewhere in the list there will also be OrdersDetailsStream@orderid. In this case, not having these columns wouldn't be a problem, but if you would like to keep all of the columns, simply rename one of the duplicate columns. Do the same for the customerid columns.
Azure for Beginners - Pipeline Orchestration - 49.1.1 Before mapping SP
Azure for Beginners - Pipeline Orchestration - 50.1 After mapping - column view SP
This will leave you with all 28 columns.
Azure for Beginners - Pipeline Orchestration - 50.1 After mapping SP
Finally, validate and publish the changes we have made.
Azure for Beginners - Pipeline Orchestration - 35. Validate and Publish SP
Azure for Beginners - Pipeline Orchestration - 50.1 After mapping SP

Run the Pipeline(s)

Now that we're finished building and altering the pipeline and Dataflow, it's time to run them.
In the sidebar to the left, select the pipeline we have created at the beginning of this lesson, which in our case is OrdersDetailsRetrievalPipeline.
Azure for Beginners - Pipeline Orchestration - 52. Select pipeline SP
Before we trigger the pipeline, make sure to have your Azure SQL Database active. In our case, this means connecting to Azure Data Studio. (For a how-to: check this part of the guide. Scroll down to the Azure Data Studio section.) Next, trigger the pipeline.
Azure for Beginners - Pipeline Orchestration - 51. Running pipeline SP
When we switch over to Azure Data Studio and refresh the tables (like we have done before), we can see that the columns from the orders_details.csv have been successfully joined with the other data.
Azure for Beginners - Pipeline Orchestration - 53. (9.1) Discount success SP
With this, we have come to the end of this part of the Azure for Beginners guide. Great job!