In this post, we move the entire stack onto Docker, a tool that enables us to package the solution and its dependencies into neat little things called container images. This will make the app portable, enable it to run consistently on different machines, and make it easier to run/stop the app.
The Series
This is the fifth post in my series, Towards Open Options Chains: A Data Pipeline for Collecting Options Data at Scale:
- Database Design
- Foundational ETL Code
- Getting Started with Airflow
- Building the DAG
- Containerising the Pipeline
Pre-requisites
In this post, we’ll be using Docker. I understand that this might be a big jump for some: from running services through the shell to deploying containers. It is a different paradigm, and requires you to learn a new toolkit. This was why I summarised the launch / teardown code in the penultimate section in Part IV: Building the DAG - it’s for readers who are happy with the pipeline we built so far and don’t feel the need to venture further. That’s perfectly fine.
But, if you’re ready to dive into Docker, make sure you have the pre-requisites below and read on.
- Operating system: Linux - I use Ubuntu 20.04 on Windows Subsystem for Linux
- Software: Docker and Docker-compose
About Containers
Let’s start with an example. Suppose that you’ve been collecting Lego for years now. You have accumulated a huge box of Lego bricks, and want to try your hand at creating your own design, say, a house. You’re pretty proficient, so you’re able to quickly assemble a house. With so many bricks collected over the years, you’re not in short supply, and use all manners of bricks - anything that makes the house look good.
You have a few options for the product. You could:
- Provide a rough manual on what goes where, and throw in whatever bricks you thought you used
- Provide a detailed manual, and throw in whatever bricks you thought you used
- Provide a detailed manual and the exact quantities of the specific bricks you used
If you’ve ever built Lego, you’d know that the first two options are out of the question. Option 1 sounds plain silly. We need to know where exactly each brick goes! Option 2 is a slight improvement, but it’s not guaranteed that we can reproduce the product. What if the customer builds a section of the house, only to find that there are insufficient bricks to create a ceiling? Option 3 is what we expect from a Lego product. We have everything we need to re-assemble the house. It doesn’t matter who is assembling it or where it is shipped from - it works!
Now take your understanding of option 3 - the neat package of everything you need to assemble a Lego design - and apply it to software applications. Option 3 is exactly what container images do: They put all the application code, dependencies, and configurations for your app into a single package. The instructions for assembling container images is also available for you to re-build the container from scratch, and you would still end up with the exact image that was distributed. When distributed as a container image and instantiated/run as a container, the app works consistently, regardless of who runs it and regardless on which machine, because it contains everything that it needs to function.
This is what we plan to do with our data pipeline solution.
Containerised Version of Our Solution
If we break our solution so far down into its components, we see only two:
- A PostgreSQL database
- Airflow
The dynamic DAG script we developed does not count as a tool on its own. Although it is a key logical piece of our solution, it is just code that Airflow uses.
To containerise our solution, we will run Postgres and Airflow containers and link them up. Of course, there are intricacies that we’ll need to handle along the way. This post will provide a detailed walkthrough.
Setting Up
Folder Structure
First, we set up the folder structure. See the diagram below for a reference.
root
├── dags # Contains DAGs to be made accessible to Airflow
│ └── open_options_chains.py # Our dynamic DAG script
├── db # Contains scripts for the Postgres instance
│ └── init-postgres.sh # Shell script to set up Postgres
├── logs # Contains logs from the Airflow instance
├── pgdata # Contains data from the Postgres instance
├── scripts # Contains scripts to initialise Airflow and run its services
│ ├── init-entrypoint.sh # Shell script to initialise Airflow
│ ├── scheduler-entrypoint.sh # Shell script to launch scheduler
│ └── webserver-postgres.sh # Shell script to launch webserver
├── .env # File containing environment variables
└── docker-compose.yml # File specifying what services to run, along with the configs
There are quite a few items here. We’ll run through the config files first. The folders and their contents will be addressed as we explore the services defined for our app (let’s call it an app for simplicity).
Config: Docker Compose File
Docker Compose is a tool for defining and running multi-container Docker applications. In the file (below), we define several services (more on this later). All you need to know for now is that we need to explicitly state how each service should be configured, and which other services it must communicate with. I will reproduce the relevant sections of the configuration file as I describe the services.
version: '3.8'
services:
postgres:
image: postgres
env_file:
- .env
volumes:
- ./db:/docker-entrypoint-initdb.d/
- ./pgdata:/var/lib/postgresql/data
airflow-init:
image: apache/airflow
entrypoint: ./scripts/init-entrypoint.sh
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
restart:
on-failure:10
depends_on:
- postgres
scheduler:
image: apache/airflow
entrypoint: ./scripts/scheduler-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
webserver:
image: apache/airflow
entrypoint: ./scripts/webserver-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
- scheduler
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
ports:
- "8081:8080"
Config: Environment File
This file (.env
) defines variables that our containers can use. The contents should be as shown below. Do remember to input your TD Ameritrade (TDA) API key.
# Postgres
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
APP_DB_USER=openoptions
APP_DB_PASS=openoptions
APP_DB_NAME=optionsdata
# Airflow
APP_AIRFLOW_USERNAME=admin
APP_AIRFLOW_PASSWORD=password
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres/${POSTGRES_DB}
AIRFLOW__CORE__EXECUTOR=LocalExecutor
# TDA API
API_KEY=<your TDA API key here>
Services
Now, we move on to the key element: services. The services that will need to be run as part of the app are:
postgres
: An instance for the Postgres databaseairflow-init
: A short-lived instance to initialise Airflowscheduler
: An instance for the Airflow schedulerwebserver
: An instance for the Airflowe webserver
Postgres Database
The lines below define our Postgres instance:
postgres:
image: postgres
env_file:
- .env
volumes:
- ./db:/docker-entrypoint-initdb.d/
- ./pgdata:/var/lib/postgresql/data
ports:
- "8081:8080"
- The
image
tag defines which pre-existing image on Docker Hub (see below) the container should be run on. In this case, we’re using the official Postgres Docker image. - The
env_file
tag points to the environment file we created earlier. This allows us to define variables in one place. Our containers can then use them. - The
volumes
tag indicates which local directories are mounted into the container. Any data that is generated in the specified folder inside the container is persisted in the local directory. For our app, we mount two volumes:.db/
on our local machine, which contains the initialisation scripts for the database, to/docker-entrypoint-initdb.d/
inside the container. The scripts in the latter are run when the container is launched./pgdata
, on our local machine to persist Postgres data, to/var/lib/postgresql/data
inside the container, where Postgres data is stored.
- The
ports
tag exposes a port to our local machine. The first port number is the port on our local machine, and the second one is the port inside the container for the Webserver service.
Note: Docker Hub is an open online repository for creating, managing, and sharing container images with others.
Airflow Initialisation
This is a service that initialises Airflow and shuts down. If Airflow was previously initialised, this service will not change the existing settings, and will shut down as well.
airflow-init:
image: apache/airflow
entrypoint: ./scripts/init-entrypoint.sh
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
restart:
on-failure:10
depends_on:
- postgres
- The pre-existing
image
used is the official Airflow Docker image. - The
entrypoint
tag specifies commands/executables that will always run when the container is launched. We will examine the script below. - The
restart
tag specifies when to restart the container. We request that this service restart on failure, up to 10 times. - The
depends_on
tag expresses the dependency between services. This affects the order in which services are started. Thisairflow-init
service will start only after the Postgres instance has. Note that the Postgres service need not be ready for this service to start, and this causes problems that our entrypoint script will resolve. - The
profiles
tag helps to group services together. It comes in handy when we want to choose specific related services to run. - The
env_file
andvolumes
tags do the same thing as they did for the Postgres service.
The entrypoint script (below) for this service does the following every time the airflow-init
container is run:
- Run a while loop attempting to connect to Postgres
- Check if a connection can be established
- If a connection can’t be established, return a message saying we’ll wait, and wait 5 seconds before trying again
-
If a connection is established, it means that Postgres is ready, and we initialise Airflow
#!/usr/bin/env bash # Wait for Postgres service to be ready until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do >&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..." sleep 5 done >&2 echo "PostgreSQL service started successfully. Initialising Airflow..." # Initialise database airflow db init # Create account airflow users create -u "$APP_AIRFLOW_USERNAME" -p "$APP_AIRFLOW_PASSWORD" -f Firstname -l Lastname -r Admin -e admin@airflow.com # Add connection airflow connections add 'postgres_optionsdata' \ --conn-type 'postgres' \ --conn-login '$APP_DB_USER' \ --conn-password '$APP_DB_PASS' \ --conn-host 'postgres' \ --conn-port '5432' \ --conn-schema '$APP_DB_NAME' \
Initialising Airflow entails the following:
- Initialise the Airflow database backend
- Create a new user based on the username and password combination specified in the
.env
file - Add our Postgres connection - something that we performed via the Airflow UI previously
If this service is run after Airflow has already been initialised, it will not affect the existing settings. After the script has run, there are no further terminal commands. Hence, this service will shut down on its own.
Airflow Scheduler
The lines below define our Airflow Scheduler service:
scheduler:
image: apache/airflow
entrypoint: ./scripts/scheduler-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./scripts:/opt/airflow/scripts
- ./logs:/opt/airflow/logs
- The pre-existing
image
used is the official Airflow Docker image - the same as the Airflow initialisation service defined previously. - This service has its own dedicated
entrypoint
script. We’ll discuss this along with the entrypoint script for the Webserver below. - The
restart
,depends_on
,env_file
andvolumes
tags do the same thing as they did for the other services. - For the
volumes
tag, we mount the DAGs, scripts, and logs folders from our local machine to the Docker container. This allows us to persist our DAGs, write scripts, and monitor logs on our local disk.
Airflow Webserver
The Airflow Webserver service has pretty much the same configuration as the Scheduler service:
webserver:
image: apache/airflow
entrypoint: ./scripts/webserver-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
- scheduler
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
ports:
- "8081:8080"
The ports
tag specifies the port on our local machine (8081) that we can use to access the webserver (8080 inside container).
Entrypoint Script for Scheduler and Webserver
The Scheduler and Webserver services have entrypoint scripts similar to the Airflow initialisation service. The scripts do the following:
- Run a while loop attempting to connect to Postgres
- Check if a connection can be established
- If a connection can’t be established, return a message saying we’ll wait, and wait 5 seconds before trying again
- If a connection is established, it means that Postgres is ready, and we launch the Scheduler or Webserver respectively
#!/usr/bin/env bash
# Wait for Postgres service to be ready
until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do
>&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..."
sleep 5
done
>&2 echo "PostgreSQL service started successfully. Launching Airflow Scheduler..."
# Launch scheduler
airflow scheduler
Providing Read and Write Access
To enable data to be written to the mounted volumes, we need to enable read/write permissions on the DAGs and logs folders in our main directory:
chmod -R 777 dags
chmod -R 777 logs
Operating the App
Setup
Before we can launch the app proper, we need to initialise Airflow. To do so, run the following in a terminal:
cd <root folder for app>
docker-compose up airflow-init
There will be some activity in the terminal as Docker Compose launches Postgres, and then the Airflow initialisation services. Once the initialisation is complete, the Airflow initialisation service will shut down. However, the Postgres service will continue running. You can stop it with docker-compose stop
if you want.
Launching the App
We can now launch the app for the first time with the following command:
docker-compose up -d
The -d
flag is to run the containers in detached mode, in the background. This allows you to close your terminal. In a browser, proceed to http://localhost:8081 to access the UI for your Airflow instance.
Shutting Down the App
To shut down the app, ensure that the UI window is closed and use the following command:
docker-compose stop
This will stop the containers, and will not delete them. Airflow will remain initialised, and when you launch it again with docker-compose up
, all your data and settings will be intact.
Instead, if you happened to do a complete shutdown with docker-compose down
, don’t sweat it. This command will stop the running containers, remove the containers (not the images), and the networks defined for the app. Fortunately, because we persisted the data, scripts, logs, and DAGs on our local disk, re-launching the app with docker-compose up
(1) will not re-initialise the app, and (2) will keep all your data and settings intact.
Full Reset - Take Caution!
To completely reset the app, do a complete shut down of the app, then delete everything inside the logs
and pgdata
folders on your local disk:
# Completely shut down app
docker-compose down
# Remove contents of mounted volumes
rm -rf logs/*
rm -rf pgdata/*
After the full reset, you’ll need to initialise Airflow again before running the app.
Extracting the Data
There are several ways to extract data from the app. First, we can export the tables to a CSV file. We would need to (1) extract the ID of the running Postgres container, (2) navigate to a folder for storing the CSV files, and (3) run a command in the container to save the data to a file. For (3), we run a command psql -U airflow -d optionsdata -c "..."
inside the Postgres container using docker exec -t
, and write the output to our target file ... > filename.csv
.
# Check Postgres container ID
docker ps
# Go to folder to store files
cd <directory-to-store-csv-files>
# Extract CSV file
docker exec -t <first-few-characters-of-docker-container> psql -U airflow -d optionsdata -c "COPY table_name to STDOUT WITH CSV HEADER" > "filename.csv"
The second way to extract data is to connect directly to the Postgres instance. Notice that we used the ports
tag in the Docker Compose file to expose a port so we could access Postgres from our host machine. We use psycopg2
to establish the connection in Python.
import psycopg2 as pg2
# Connect to database
conn = pg2.connect(host='localhost', database='optionsdata', user='airflow', password='airflow', port='5432')
You can then query the data using the connection object conn
as you would for any other database.
Summary
And there you have it - a data pipeline solution for collecting options data! In this post, we set up our app comprising a PostgreSQL instance and an Airflow instance using Docker Compose. If you’ve followed the series all the way, that’s awesome - you’d have replicated my solution on your own computer. And if you didn’t, you can always clone my Open Options Chains repository and follow the instructions in the Readme.
Credits for image: Kevin Ku on Unsplash