Airport Traffic by Domestic and International Carriers


Business Case

As a research and analytics team at an international carrier, we need to create data pipelines to produce reports using publicly available U.S. Department of Transportation (USDOT) airport traffic data. Airline executives have been consuming this information indirectly through USDOT reports and additional assets produced by business analysts. Our mandate is to create a data pipeline that drives a reporting system on international flights. The reporting system, including changes to the data pipeline, will be maintained by business analysts in various lines of business with help from the analytics team. Analysts often work on such pipelines that take larger datasets and shrink them into smaller dimensions. The goal of our team is to help them do so faster, more efficiently and in a reproducible manner.

Supporting Data

The data comes from the U.S. International Air Passenger and Freight Statistics Report. As part of the T-100 program, USDOT receives traffic reports of U.S. and other international airlines operating to and from United States airports. Data engineers on the team ingest this publicly available data and provide us with the following datasets:

  • Departures: Data on all flights between U.S. gateways and non-U.S. gateways, irrespective of origin and destination. Each observation provides information on a specific airline for a pair of airports, one in the U.S. and the other outside. There are three main columns recording the number of flights–Scheduled, Charter and Total.
  • Passengers: It records the total number of passengers for each month and year between a pair of airports, as serviced by a particular airline. The number is also broken down by those in scheduled flights plus those in chartered flights.

We will start with data for year 2017.

Workflow Overview

The final pipeline in Dataiku DSS will look like this


The Flow has the following high-level steps:

  1. Download the datasets to form the input data
  2. Clean the Passengers dataset, group by airport id and find the busiest airports by number of passengers that year
  3. Clean the Departures dataset and turn the monthly information on airport pairs into market share data

Technical Requirements

To complete this walkthrough, the following requirements will need to be met:

  • Have access to a Dataiku DSS instance–that’s it!

Detailed Walkthrough

Create a new Dataiku DSS project and name it International Flight Reporting.

Finding the Busiest Airports by Volume of International Passengers

In the Flow, select + Recipe > Visual > Download. Name the output folder Passengers and create the recipe. Add a new source and specify the following URL: Run the recipe to download the files.

Create a New Files in Folder dataset from the passengers folder. With the folder selected, we can get to this option via Actions > Create dataset (found in the upper-right corner). Click Test to let Dataiku detect the format and parse the data accordingly. Create a dataset named passengers.

With the new dataset as the input, create a new Sample/Filter recipe. Turn filtering On and create a filter that keeps rows where Year equals 2017. Under sampling, choose No sampling (whole data).

With the new dataset as the input, create a new Prepare visual recipe with the following steps in its script:

  1. Parse the date_dte column into a proper date column. Dataiku should detect the correct date format as yyyy-MM-dd HH:mm:ss. If it does not, go ahead select the yyyy-MM-dd HH:mm:ss date format manually in the Smart date editor. Name the output column date_parsed.
  2. Identify the months using month names. One way to do so is to use the Find & Replace processor to replace the numerical values in the Month column to a new column called month_name. An example of a replacement is using “Jan” instead of the value “1”.
  3. Concatenate the following three columns Year, Month and month_name using - as the delimiter. Name the output column year_month.


Among the drop-down options for Matching Mode, select Complete value, since Substring will result in some errors. For example, substring replacement will each substring in 12 (1 and 2) with JanFeb instead of Dec.

Next, we are going to aggregate the information by airport to create a list of the 20 busiest airports for international travellers. We’ll use the Group visual recipe:

  1. Group by usg_apt.
  2. Name the output dataset passengers_by_airport
  3. In the Group step, deselect Compute count for each group and then select the following aggregations: fg_apt (Distinct), Scheduled (Sum), Charter (Sum), Total (Sum)
  4. Rename the column names in the Output step according to the table below, then Run the recipe.
Original name New name
usg_apt IATA_code
fg_apt_distinct airport_pairs
Scheduled_sum Scheduled
Charter_sum Charter
Total_sum Total

Finally, to narrow down the top 20 airports by volume of international passengers, let’s use the TopN recipe.

  1. Name the output dataset passengers_by_airport_top20.
  2. In the Top N step, Retrieve the 20 top rows sorted by the Total column in descending order icon-descending
  3. Run the recipe.

This recipe produces a list of the busiest airports by volume of international passengers. We can now export the dataset as a CSV, or share it with other projects in the instance. In a few easy steps, we’ve replicated the table on this Wikipedia page, even down to the total number of passengers.


Calculating the Market Share of Carrier Groups

Next, we’ll create a data pipeline for the information of flight totals from the dataset on international departures to and from U.S. airports. Again, let’s use a Download recipe. Name the output folder departures, then add the following URL as the data source: Create a New Files in Folder dataset from the departures folder.

Copying Existing Recipes to Prepare Departures Data

As with the passenger data, we want to look at the 2017 departures data.

  1. Select the Sample/Filter recipe in the Flow and choose Actions > Copy.
  2. Select the departures dataset as the input.
  3. Name the output dataset departures_filtered and click Create Recipe.

Now look through the columns and of the departures_filtered dataset. They look quite similar to the initial passengers dataset. We can re-use the data preparation steps from the earlier pipeline. We can do so by copying the entire recipe, as we did with the Sample/Filter. An alternative is to copy and paste the steps from the first Prepare recipe into a new one for this pipeline.

  1. Navigate to the existing Prepare recipe, and select all the steps in the Script.
  2. Within the Script, select Actions > Copy 3 steps.
  3. Create a new Prepare recipe using the departures_filtered dataset as the input and name the output something like departures_prepared.
  4. In this new recipe, paste the copied steps.


Here’s a helpful GIF that shows how to copy paste steps from Prepare recipe to another.

Pivot to Aggregate Carrier Group Totals into Columns

Each row in the departures_prepared dataset represents travel between a pair of airports during a month. In order to compare domestic vs. international airlines, we want to aggregate this dataset by the carriergroup column (0=domestic airline) for each month of the year. The aggregated values we want to compute are the number of Scheduled, Charter, and Total flights. With the departures_prepared dataset selected:

  • Choose Actions > Pivot.
  • Pivot By the carriergroup column.
  • Rename the output dataset to something like departures_by_carriergroup.
  • Click Create Recipe.
  • Make Year and Month the row identifiers.
  • Deselect Count of records to populate content with, and
  • Instead, select the columns Scheduled, Charter and Total from the dropdown menu and choose sum as the aggregation for all of them
  • To keep a more descriptive row identifier, also keep

Next, we will add a Prepare recipe (name the output dataset departures_by_month) to clean up the pivoted data and create a few new columns. We will group the steps together so we can copy paste the steps. In brief:

  1. Let’s first work on the columns relating to scheduled flights.
  2. We’ll start by creating a total of scheduled flights using the formula processor (0_Scheduled_sum + 1_Scheduled_sum) to create a new column Scheduled_total
  3. Next, create two more columns Scheduled_US_mktshare and Scheduled_IN_mktshare for market shares of U.S.-based and international carriers. The formula should look like the following: [X]/Scheduled_total * 100 where X is either the 0_Scheduled_sum and 1_Scheduled_sum columns
  4. To organize these steps, +Add A Group and name it Scheduled. Select the steps we just created and add them to the group using the Actions dropdown and then +Add to Group > Scheduled
  5. Copy the group of steps we just created into two new groups called Charter and Total, and change the column names in order to compute these columns for the chartered flights as well as the overall total.
  6. Now, we can remove some of the columns we used. Instead of selecting individual columns to keep or remove, we’ll use the same processor along with a regular expression. Add a step using the Delete/ Keep processor and choose to Remove columns. To remove all columns starting with 0 or 1 and followed by a word character (containing a-zA-Z0-9_) of indeterminate length, use the pattern as the following regular expression: ^[0-1]_\w*


Regular expressions (regex) are used to define a search pattern using a sequence of characters. They are quite powerful and extensible and can be used in Dataiku DSS in many places. You can find a good introduction to regex at the Python for Informatics course slides and also test out regex patterns online at

Great, we’ve created two summaries of larger datasets and shrunk them down into datasets with only a few dozen rows. In the first data pipeline we got down to the top 20 busiest airports. Then we also calculated the monthly totals of flights and the market share of two categories of carriers for the year 2017.

Adding a Lagged Window to Calculate Year-to-Year Change

To compare the total number of departures with the year before, we can re-use the same data pipeline. First we need to widen our Filter to include data from the year 2016.

By going back to the Filter recipe, we can add another condition where Year equals 2016, and keep rows that satisfy at least one of the conditions.

In the Flow, right-click on the Filter recipe and select Build flow from here to re-create all datasets in the pipeline.


Here is documentation about re-building datasets in Dataiku and the difference between building datasets upstream or downstream

The departures_by_month dataset now has totals of departures for two years: 2016 and 2017. Now we can calculate how the traffic changed from year to year. With departures_by_month selected:

  1. Choose Actions > Window
  2. Click Create Recipe.

To get the total flights from a year ago to compare with the totals for a particular month, we need to get the same value lagged by 12 months. In the Window definitions step of the Window recipe:

  1. In the Windows definitions step, turn on Order Columns and select Year and Month so the months are laid out in ascending order. This defines how the dataset will be ordered for the lag to be calculated.
  2. In the Aggregations step, Retrieve all of the columns and and for the Total column, add the lagged value going back 12 rows, i.e. months or a whole year.
  3. Run the recipe

The output dataset should show that all the months in 2017 now have a value for the lagged total number of flights in the column Total_lag12. For any month that year, it should be the match the value of the same month from a year ago.

To create the final presentation dataset; add a Prepare recipe with the following steps in the Script:

  1. Keep only rows that from the year we need: 2017
  2. Calculate a column for year_toyear_change with the formula (Total - Total_lag12)/Total_lag12 * 100
  3. Keep only following columns: Year, Month, Total_US_mktshare, Total_IN_mktshare, Total, Total_lag12, year_toyear_change

Learn More

Great job! Building data pipelines is essential to creating data products. This is a first step in doing more with data. Data products can go beyond static insights like rankings or tables, and the process can be automated for production with scenarios.

Also see: