Azure for Beginners - Creating a Data Flow & Pipeline in Azure Data Factory

In our Azure for Beginners series, so far, we have created an Azure Storage Account, an Azure SQL Database, an Azure Data Factory, and have shown how to move data using all three. However, we have yet to explore the full potential of Azure Data Factory. Where ADF really shines is in creating data flows and pipelines.

• Data flows allow us to transform data through a visual interface, without necessarily having to write code. It uses a drag-and-drop system, which makes it easier to design and visualize complex data transformation logic. For example, if we want to join, aggregate or filter filter our data, we will use a data flow. A data flow cannot run by itself; it needs a pipeline to run.

• These data flows can be integrated into pipelines, which are logical groupings of activities which, together, perform a task. Where data flows are centered around transforming data, pipelines are focused on orchestration. They allow us to manage activities (like copy data activities or data flow activities) as a set instead of each one individually and independently. For example, if you have separate activities which ingests, cleans and analyzes your data, these can be 'connected' together using a pipeline, and have them trigger one after the other. Since there are many activities in Azure Data Factory, this means a pipeline can run without a data flow, but a data flow cannot run without a pipeline.

In this part of our Azure for Beginners series, we will create a data flow which will take two of our CSV files in our Azure Storage Account, join them together, filter the joined data, and write the data into our Azure SQL Database. We will then create a pipeline in which we will run this data flow.

Note: The look of Azure tends to change regularly. Your screen might not look exactly like the images in this guide. Some features might have been moved to a different tab, but the general steps should be the same.

Note: please note that the services within Azure are all paid services. The cost is dependent on your usage and configuration. (More information can be found here.) It's advised to do this under the supervision of a database administrator or a data engineer.

Azure Data Factory

Same as you have done before, log into your Azure account and navigate to Azure Data Factory. You can do this either by clicking on the Data factories icon in the shortcut tray on top, or, if it's not in there, by clicking on More services all the way to the right of the shortcut tray and scrolling down to the Analytics header, where you will find Data factories as well. After having loaded the Data Factory area, click on the Data Factory you have created during a previous lesson.

Like you have done in the previous lesson, launch Azure Data Factory Studio by clicking the 'Launch studio' button.

Instead of clicking on 'Ingest', this time we will click on 'Transform data'.

New Data Source

You will be met with a new interface. In here, you will be able to create, among other things, data flows and pipelines. Our goal is to take two data sets, join them, filter them, and move them to our database. It's probably no surprise, but things can be done in several ways in Azure. For example, since we need to select the data we want join first, we could begin by directly adding them under the Datasets header. (Hover over the 0 and three dots will appear. Click on it, and click on 'New dataset'.) However, we will do so by creating a new data flow and add the data sets in the process.

To begin, we will first give our new data flow a descriptive flow.

Since we will transform our orders.csv data set in this data flow, we will call it 'OrdersTransformation'.

Next, we will want to add our orders.csv data set. Start by clicking on 'Add Source'.

This will create a new source, which we will point towards our orders.csv data set.

When having this source selected (which will give it a blue color), we will need to configure this source on the lower side of the screen. As is common in Azure, this will lead us through several tabs, which can be seen on top.

On the Source settings tab, make up a name for the source. In our example, we have called it 'OrdersStream'. For Source type, we have two options:

• Dataset - By selecting 'Dataset' in Azure Data Factory, we will create an object which represents the data structure we will use. (We aren't loading the actual data, but rather the Dataset object which represents the data.) In our case, this data structure is a CSV file in our Azure Storage Account. This object is reusable, so if we want to reference the same CSV file in our storage account again, we could use this reuse this Dataset. For reusable, complex, or frequently accessed data sources, Dataset is the better option.

• Inline - The 'Inline' source type, on the other hand, can only be used once. It doesn't create a reusable object, but rather defines the data source details within this activity. For simple, one-off situations, inline data sources can be a great option.

We will select 'Dataset' as Source type and click on '+ New' to select our data set.

Seeing as our CSV file is stored in our Azure Storage Account, click the button 'Azure Blob Storage'.

Select the correct format of the data set you want to use. In our case it is a CSV file, which is a 'DelimitedText'.

First, name the new data source. Below it you will have to select a Linked service, which might require a small explanation.

Azure Data Factory lets you connect to various data sources by creating an object called a 'linked service.' A linked service is essentially a set of configuration settings that define how to connect to your data source, such as a file in your Azure Storage Account, a table in your Azure SQL Database, or many other options. By using linked services to connect to data sources, you can ingest data and prepare it for transformation or analysis. (Additionally, linked services can connect to compute resources, enabling you to leverage processing power to transform your data.) If you already have a linked service to your Azure Storage Account, select it; otherwise create a new one.

Then, click on the folder icon to navigate to the data set you want to use.

In our case, we will use the orders.csv file we have uploaded during the creation of our Azure Storage Account.

If you are using a CSV file with a header as data source, be sure to enable the 'First row as header' feature. Below it, you'll have to select how you want to import the schema for the data source. In our case, this will be done from the connection/store, which means Azure will import the schema directly from the data source we have selected. It's entirely possible you might possibly want to use your own sample file containing a schema, or to import no schema and if you want to manually set it later on. Click 'OK' when you're done.

Click the Source options tab on top. The Source options tab shows us several possibilities for ingesting the file. These options depend on the file type of your data source, which in our case is a CSV file. Although we will leave all of the options in their default state for this beginners guide, this is a short description for these options:

• Wildcard paths are used in such cases where you want select multiple files of same type. E.g., let's say our folder contains files like orders1.csv, orders2.csv, etc. This folder also contains many other files, using other names. If we want to select only files starting with 'orders', we could use a wildcard path like orders?.csv, which will fetch only our orders files. If we would have many orders files in it, going into the double digits, we will have to add an extra '?' for every extra digit. E.g., to match orders123.csv we will have to use the wildcard orders???.csv. If you don't want to set an upper boundary, you can use the wildcard orders*.csv.

• Partitions root path has to do with partition discovery. By specifying partitions root path, you will generate additional columns in your data set based on your file path. Let's say your data set is stored in these folders:

◦ root/yournewcontainer/year=2024/month=12/day=01

If you specify Partitions root path to be:

◦ root/yournewcontainer/year=2024

Azure Data Factory will generate two additional columns in your data set:

◦ The column month, containing the value '12'.

◦ The column day, containing the value '01.

• Allow no files found - When enabled, you will not get an error if no files are found. This is a great option if you want to create a pipeline in which you do not expect your upstream(s) to send you files everyday, but don't want your pipelines failing.

• List of files - If you want to select specific files (but not all files), you can use this feature. The difference with Wildcard paths is that List of files doesn't require the filenames to share a certain pattern. Where Whilecard paths require orders1.csv and orders2.csv to share one or more parts of the filename, in this case the 'orders' and '.csv' parts, List of files doesn't have this requirement. This means that you wouldn't be able to select the files abcd.cvs and efghijk.json out of multiple files with Wildcard paths, but you could with List of files. When enabled, you will have to specify the path to this list, in which you specify all the files you want selected. This can simply be a text file with every filename on a separate line.

• Multiline rows - If your source file makes use of multiline rows, where rows are spread out over more than one line, select this feature to make sure Azure Data Factory interprets these rows as single rows rather than multiple rows. Multiline values must be in quotes.

• Maximum columns - Here you can set the maximum amount of columns, which defaults to 20480. If you have more columns than the default amount, you should specify it here.

• Change data capture - If you want only the changes in your source data set moved to your destination data set, you can enable Change data capture (CDC). This feature will the source data that has changed since the last time the pipeline ran, rather than the entire dataset on each run. In data terms, these changes (like new rows being added, existing rows being updated, or rows being deleted) are also known as the 'delta'.

• Column to store file name - This will create a new column in your destination data set with the source file name and path.

• After completion will either delete or move your source data set(s), or do nothing. Enabling one of the former two options will have you specify the specific paths.

• Filter by last modified lets you filter your files based upon when they were last altered. For example, if you want to select only files which were last modified during a specific period of time. You will have to specify the start and end of this period.

On the Projection tab, we can define and manage the schema (i.e., the structure) for our destination data set. Here you will be able to set the data type for your columns (e.g., strings, integer, etc.), as well as the format of the column.

The Optimize tab in Azure Data Factory lets you choose whether or not you want to repartition your data after transforming it. Here you will have a choice between:

• Use current partitioning - This is the default option and will let Azure Data Factory keep the current output partitioning of the transformation. This option is recommended in most scenarios.

• Single partition - This will combine all the distributed data into a single partition. Since there will be this single, big partition, there won't be any parallel processing going on. This will become a very slow process, which will have a negative impact on all downstream activities. If you don't have an explicit business reason to use it, it's best to be avoided.

• Set partitioning - You will be able to choose from these partitioning types:

◦ Round robin, which will distribute your data equally, based on the number of physical partitions you've set. This is a good option if you don't have a good partitioning strategy.

◦ Hash, which produces a hash of columns to have rows with similar values fall in the same partition. You can set the number of physical partitions.

◦ Dynamic range, which uses Spark dynamic ranges based on the columns or expressions that you provide. You can set the number of physical partitions.

◦ Fixed range, which will have you build an expression that provides a fixed range for values within your partitioned data columns. You can set the number of physical partitions.

◦ Key, which will create partitions for each unique value in the column you have specified. The amount of partitions is based on the amount of unique values in the column you've chosen. For example, let's say you have a data set on your employees, and you have chosen the column department as the Key, which contains the departments your employees belong to. This option will create a new folder called 'department=....' (with .... being the department name) for every department in that column. This folder will contain the partition with the employee data for that specific department only.

For our beginner's guide, we will leave it on 'Use current partitioning', and move on to the Inspect tab.

On the Inspect tab, we can see a read-only view of the metadata of the data we are transforming. It's a quick overview of columns, the column count, the columns we have changed or added, data types, column order, and column preferences. Since we haven't made changes to the data, we currently only see the Total column count right under the tabs. When you make changes, you will also see counts like New, Updated, and Unchanged.

Next, we will move on to the Data preview tab.

It's always a good thing to check the Data preview tab, even if it's just to check if the data has been loaded correctly. If you see the message "Please turn on debug mode and wait until the cluster is ready to preview data", you will first have to turn on the debug mode. This can be done by flipping the Data flow debug switch.

Select the right integration runtime and the time you want the Debug time to be activated. For our beginner's guide, we will leave the settings untouched and click 'OK'.

Activating Debug mode might take a while.

Click Refresh in the top left corner of the Data preview tab. If everything went well, you should be able to see a data preview of our data.

Add another data source

Next, we're going to add another data source.

This is done in the same way we have added the previous data source. We have added another CSV file, called 'Customers', which is why we have called the stream 'CustomersStream'. No adjustments have been made on the rest of the tabs.

Creating a join in your data flow

We will now join the two datasets we have added together into one dataset. If you are familiar with joining tables in SQL, Python or visualization programs like Tableau, Power BI or Looker Studio, joining tables is probably nothing new to you. If not, it's quite straightforward: you take two tables and combine them into one (wider and possibly longer) table. This lets you combine data from multiple data sources.

So far, we have added our orders.csv data set (OrdersStream), which is basically a table with the orders placed by customers, and we have added our customers.csv data set (CustomersStream), which is a table with our customers' info. We want to join these two data sets into one. This will let us have one data set with the orders and the customers who have placed these orders. In many cases (but not always), a join is done through a column both tables have in common, which in our case is the 'customerid' column. In our customers data set, the customerid column uniquely identifies each customer. (No two customers will have the same customerid.) In data terms, this column is the so-called primary key. In the orders data set, the customerid column references the primary key in the customers table. In the orders data set, the customerid column is not unique. Customers might place several orders, which will make their customerid appear multiple times. In actuality, the customerid column in the orders data set refers to the primary key in the customers data set. This column is called the foreign key. In many cases (but not always), joins will be made on primary and foreign keys.

To create a join, click on the + sign next to one of your data streams.

This will give you a list of all kinds of data transformations. Since we are looking to join our two streams, select the 'Join' option in the list.

Similar to adding a data source, we will now have to configure our join. First, give it a name. Below it, you also have the option to add a description, but it's not mandatory. Then you will have to select the streams you want to join. We have selected 'OrdersStream' as our Left stream and 'CustomersStream' as our Right stream.

You will then have to choose a Join type:

• Full outer - Returns all rows when there's a match in either the left or right table. If there's no match, it fills in NULL for the missing side.

• Inner - Returns only the rows that have matching values in both tables.

• Left outer - Returns all rows from the left table and the matched rows from the right table. If there's no match, the result is NULL for right table columns.

• Right outer - The opposite of a left join. It returns all rows from the right table and the matched rows from the left table.

• Custom (cross) - Returns the Cartesian product of both tables (i.e., each row from the first table combined with each row from the second table).

In our case, since we aren't sure if all the customers in our customers.csv will be present in our orders.csv (it might just be the orders of a certain period), we will choose a left outer join as the join type.

Since we want to join the orders in our orders.csv data set with the customers in our customers.csv data set who have placed these orders, we will need to match the right customers to the right orders. This is done through the join condition; a column both data sets have in common which we talked about earlier. Like we mentioned before, in our case this is the 'customerid' column. Selecting 'customerid' as the Left: OrdersStream's column. Then select '==' as equality operator, which means that values in the columns for both the left and right stream have to match. Next, also select 'customerid' as the Right: CustomersStream's column.

You will also see the join visualized on the canvas above, with a total of 23 columns. (13 columns from the OrdersStream and 10 columns from the CustomersStream.)

On the Optimize tab, we see an option we didn't see on the same tab when we were selecting data sources: Broadcast. To know what it does, you will have to know a bit more about Azure. Azure Data Factory is based on Apache Spark, which is an open-source tool (distributed computing system) designed for fast processing of large-scale data. It is widely used for big data analytics because it can process massive datasets relatively quickly. It does so by breaking the data into smaller pieces and processing them all at once using multiple computers, which is called a cluster, with each computer in it being a node. Nodes can be either a master node or worker node. Master nodes are like managers, they instruct the worker nodes what to do, while the worker nodes to the actual work.

Let's say you want to join a small data set with a large data set. When you create a join, your data will be shuffled around to create said join, which takes time. To prevent this shuffling of data, if the small data set is small enough to fit in each worker node's memory, you can send (broadcast) a copy of the data set to each worker node in the cluster. Since each worker node now has a copy of the small data set readily available, it can it can quickly match it with its part of the big data set without having to shuffle all the data around.

If you leave the setting on 'auto', you will let Azure Data Factory decide if your data sets fit the requirements for broadcasting. Fixed broadcasting, on the other hand, will have you decide if the smaller data set is small enough to be broadcasted. If it turns out to be too large, you might run into an out of memory error. You also have the option to turn off broadcasting.

Below it, you also have a Partition option. Since we have talked about it before, we won't elaborate on it further. For our beginner's guide, we will leave these settings in their default state.

Similar to the Inspect tab we have seen during the creation of our data source streams, the Inspect tab shows us a read-only view of the metadata we are transforming. Here we see which columns are coming from which data stream, as well as the total amounts of columns coming from each.

The Data preview tab shows us that the data has been joined correctly. If the data preview is not working for you, make sure to enable the Data flow debug mode like we have shown you before, and click 'Refresh' in the top left corner of the Data preview tab.

Adding a filter

We will now add a filter. If possible, it's always best to filter your data as soon as possible. In our case this would have been pre-join, so less data is processed during the join operation. However, for demonstrative purposes it's good to show how to know to apply a filter. Click the + sign next to the join we have just made.

Select 'Filter' out the list.

On the Filter settings tab, make up a name for your filter, select an incoming stream (which is our join). Then, click on the Filter on field. Below the field, a link to the expression builder will appear. Click on the link.

In the expression builder, you can use expressions and functions to create a filter. In our case, we want to create a filter which keeps only the rows where the country column contains the value 'Germany'. We can use the equals function, which checks whether both values in it are equal. (In our case, it checks if the values in the country column equal the value 'Germany'.) When you're finished building your expression, click the 'Save and finish' button.

You will see the expression appear in the Filter on field.

When we look at the country column, we see only 'Germany' values in it. This means that our expression worked as intended.

Adding a Sink

Next, we will be adding a sink transformation, also known as a sink. When you have transformed your data, you will need to write it to a destination; your transformed data needs to be stored somewhere. This is done through a sink. Every data flow requires at least one sink, but you can add as many sinks as you deem necessary. Each sink is associated with exactly one dataset object or linked service, so if you want to data to be stored in your Azure SQL Database and in your Azure Storage Account, you will need two sinks for it; one for each destination you want to write your data to.

To add a sink, click the + sign next to your filter.

Then select 'Sink' in the list.

In the sink configuration, make up a name for your sink and select the incoming stream (in our case our filter). Then, select a destination you want to write your transformed data to. In our case, we will be writing the data into our Azure SQL Database. We will also have to specify the Sink type.

• Dataset - This sink type will save instructions on where your data is stored, how to connect to it, and what the structure of that data is (like table columns or file format). This is great for when you want to reuse it across activities and pipelines.

• Inline - The Inline data sink type will define the details on your data within the sink. This is a good option for simple, one-time use situations. It's best to use it on small data sets only, as inline data is included directly in the pipeline rather than in a separate storage.

• Cache - A Cache sink type will temporarily store the data for quick access during processing, which helps performance, because it won't need to fetch data from a slow source repeatedly. This is great for large data sets which are used multiple times in the same data flow.

Since our dataset is very small, an inline sink will suffice. Naturally, we next need to select the final destination for our transformed data, which is our database in our Azure SQL Database. If you're following along, make sure you are connected to your Azure SQL Database! This can be done via an application such as Azure Data Studio. (Scroll down to the Azure Data Studio heading and follow the connection instructions.) Click the 'Test connection' button and test if the connection has been established.

In the Settings tab, you will have the following options:

• Schema name - Your table will be written into this schema. Click the 'Refresh' button to make sure all your schema names are loaded. Select the correct schema name. If you're not sure, re-read the part on schemas in this beginner's guide.

• Table name - This is the name your table will get in your Azure SQL Database. Make up a new name for it. We will choose the name 'orders_customers'.

• Table action - This is what will happen to the target table prior to writing to it. This is, of course, especially important if you're running the data flow regularly.

◦ None - No action will be done to the table.

◦ Recreate table - The table will get dropped and recreated. Required if creating a new table dynamically.

◦ Truncate table - This will keep the existing table structure but first truncate all rows, and then insert the new rows.

• Update method - You will have to specify what operations are allowed on your destination table in your Azure SQL Database. The default is to only allow inserts.

• Use TempDB - When we are writing data to our Azure SQL Database, the data doesn't usually go directly into the destination table, which is especially true for large data sets. It will first be loaded into a temporary storage. By enabling Use TempDB (which is the default setting) Azure will use a global temporary table in the TempDB database in SQL Server, which is used for temporary storage purposes. Once fully loaded into this temporary storage, it will then be moved into our Azure SQL Database. However, if we disable the use of TempDB, this temporary storage is created directly in our Azure SQL Database. That might sound a bit superfluous, but such a staging data acts as a buffer to make the loading process smoother, safer, and more efficient.

• Pre SQL scripts & Post SQL scripts - You can enter multi-line SQL scripts here which will execure before (pre-processing) and after (post-processing) the data is written to our Azure SQL Database. You can use this to clean up clean up the pre- or post-loaded data.

The Errors tab will give us options to manage what happens if the writing to our Azure SQL Database may, for whatever reason, runs into trouble. For our beginner's guide, we will leave this tab untouched. For informational purposes, we will give a short description of the available options.

• Linked service - Here you would have to select where you want to write your transformed data to. In our case, this would be Azure SQL Database. When you have selected your linked service, you will be able to test the connection.

• SQL error rows - In this section you will be able to choose what happens when writing to your database runs into an error due to constraints set by the destination (Azure SQL Database).

◦ Error row handling - By default the data flow will fail on the first error it encounters. Alternatively, you can choose 'Continue on error', which will let your data flow continue, despite individual rows containing errors.

◦ Transaction Commit - This option will let you choose whether your data gets written in batches or in a single transaction. The latter will have a slower performance, and no data will be visible until the entire transaction it completed.

◦ Output success on error - If you want to have your data flow be marked as a success, despite having run into errors, enable this option.

◦ Output to separate file - This will output the rows which ran into an error into a separate CSV file in your Azure Blob Storage or Azure Data Lake Storage Gen2 account (for which you will have to specify a destination). The rows containing an error will have three columns added: the SQL operation (e.g., INSERT or UPDATE), the data flow error code, and the error message on the row.

• Assert failure rows - If you have assertions set up, here you can manage what happens when you run into errors with Assert failure rows. In Azure Data Factory, an assertion is a rule that you can apply to your data to validate certain conditions. It's like a quality check. For example, you might assert that a certain column must not be null. When this column does contain a null, you will run into an error. The option Output to sink will have these rows still be included in our Azure SQL Database. Output to separate file will output the rows containing errors to a separate file.

For this beginner's guide, we will leave these settings in their default state. If you want to set up assertions, click on the + sign next to a data stream, select 'Assert' and configure it to your liking.

We will skip the 'Assert' and move onto the Mapping tab. Here you can decide which incoming columns will get written to your destination, which in our case is our Azure SQL Database. By default, Auto mapping is enabled, which means that all incoming columns will be mapped.

You will also have options to skip duplicate input and output columns. Since we have made a join on a column which both source data sets have in common (the customerid column), it also means we will have a duplicate output column. This is visible in our mapping, where we see the input columns OrdersStream@customerid and CustomersStream@customerid, which means that both the OrdersStream and CustomersStream have provided a customerid column. To make sure we don't have a duplicate column in our Azure SQL Database, we can enable these options.

However, you can also import a schema or manually adjust the mapping. For demonstrative purposes, we have manually adjusted the output column for the CustomersStream@customerid to be 'customerid_orders'. This means that the customerid column provided by our CustomersStream will be renamed to 'customerid_orders' in our Azure SQL Database.

Like before, the Optimize tab will let you choose whether whether or not you want to repartition your data after transforming it. Since we have walked through the available options before, we will skip it this time.

This time, the read-only view of the metadata on the Inspect tab reflects the change we have just made to the customerid column coming from the CustomersStream. Next, check the Data preview tab to see if all your data is there.

Validate and publish

Now that we have finished our data flow, we will have to validate and publish the changes we have made.

By clicking 'Validate all', we can check Azure runs into any errors with whatever we have just set up. This is any important step before we try to publish everything.

If all went well, you should see the message below.

Next, we will have to 'publish' our data flow. This means that all the changes we have made so far will be committed and saved. Everything we have set up will go live.

When completed, you should see this message.

Create a pipeline

Like we have mentioned before, a data flow cannot run by itself. It needs a pipeline to run, which is what we are going to set up next. Pipelines can be very elaborate, but in our case, we only need a simple one to run our data flow. We can create one by clicking on the + button under the Factory Resources header, and then go to Pipeline > Pipeline.

Next, simply drag the data flow we have just created to the canvas.

Next, give your pipeline a name. We will call ours 'OrdersTransformationPipeline'.

An important thing to keep in mind when working with pipelines is the different levels. When you have an activity selected, like the data flow we have just added, the options we see below the canvas are options on the activity level. The settings you adjust there are applied to that specific activity, not to the entire pipeline. When we don't have an activity selected, the options we see below the canvast are on a pipeline level. The settings we adjust there are applied to the whole pipeline.

Since we still have our data flow selected, we will shortly walk through the configuration options on the activity (data flow) level.

• Activity state - This option lets you deactivate ('comment out') your pipeline, without having to delete it.

• Timeout - The timeout of your pipeline is the duration of time your pipeline waits for all the activities in it to finish. Once the duration has been reached without finishing, it will throw a timeout error.

• Retry - Here you can set the amount of times a failed activity can by restart itself without manual intervention (someone having to trigger the pipeline). E.g., if the activity fails due to connectivity issues, it can retry by itself.

• Retry interval (sec) - If a retry is needed, you will need to specify the retry interval, which is the amount of seconds it needs to wait to rety upon failing.

• Secure input + Secure output - Depending on the data you're working with, you might benefit from securing the in- and output. This means that your activities aren't logged in the monitoring section of Azure Data Factory.

For our beginner's guide, we will leave all but Name in their default settings.

The Settings tab contains the following options:

• Data flow - If it hasn't already been automatically selected for you, select the correct data flow.

• Run on (Azure IR) - Then, select the correct integration runtime (IR), which will be 'AutoResolveIntegrationRuntime' by default.

• Compute size - Select the right Compute size for your IR. Data flows run on Spark clusters. The configuration for these clusters is defined in the IR of the activity. Data flows distribute the data processing over different cores in a Spark cluster to perform operations in parallel. The bigger the Spark cluster, the more cores it contains; the more cores it has, the more processing power is available. The default cluster size is 'Small', which translate into four driver cores and four worker cores. 'Medium' and 'Large' clusters have eight and eight, and sixteen and sixteen driver and worker cores, respectively. When you need to process lots of data, you might want to increase the cluster size. For our beginner's guide we are using a very small amount of data, for which a small compute size will suffice.

• Logging level - Next, we will have to choose a Logging level, which is amount of information is recorded about what happens when your data flow runs.

◦ Verbose - Verbose is the most detailed level of logging, and will fully log every single step and partition in your data flow. This is a great option for troubleshooting. However, it might generate a lot of data, which might cost more to store. Verbose is the default logging level.

◦ Basic - This level of logging will only record the duration of the data transformations (how long each transformation takes). This might be enough to monitor performance.

◦ None - Only a summary of the durations will be given.

For our beginner's guide, we will only select our data flow and IR, and leave the rest of the settings in their default state.

The Parameters tab will let you pass dynamic values into your data flow when you run it. This makes pipelines and data flows incredibly flexible and reusable. For example, remember when we added a filter in our data flow which filtered out everything but the rows where the country column contained the value 'Germany'? If, later on, we wanted to change this to another country, we would need to create a new data flow for it, because the value 'Germany' would be hardcoded into our filter. Parameters are a way to circumvent that. They allow us to dynamically enter 'Germany', but also change it to another country when needed.

The User properties tab lets you add dynamic information (key-value pairs, up to 5 per activity) about the activity which might help you when you monitor its performance.

Publish and trigger your pipeline

Now it's time to validate and publish our pipeline. Like we have done before, first validate, then click 'Publish all' to publish all the changes we have made.

If all went well, you should see this message.

Pipelines need to be triggered, which you can make happen by clicking 'Trigger now', which will run the pipeline right away.

Since this is our first pipeline run, no records of a previous pipeline run can be found. Click 'OK' again.

When your pipeline is active, you should see this message.

Now it's time to check our Azure SQL Database, to check if it has successfully received our transformed data. If you haven't already, start up Azure Data Studio (the one not in your browser) and connect to it like we have done several times before. Click on the refresh button next to Tables.

We can now see the table our data flow has created.

To make sure our new table contains the right data, we can quickly run a query to check.

As we can see, our table contains the right data. Only rows where the country column contains the value 'Germany' have been added.

This concludes our Azure for Beginner's guide to creating a data flow and pipeline in Azure Data Factory. Well done!