1. Installation
source: Quick Start
1.1. Core Airflow
Before installing the airflow, we need to set the home directory for Airflow where the complete setup would be done.
Install the python3-venv package using sudo apt install python3-venv.
Once installed, set the home directory for airflow to the current airflow project directory.
Alternatively, the airflow chooses the ~ to be the home directory. if you have previous installation of airflow in that directory, you can set another directory for airflow by changing the path in the environment variable, which the airflow installation uses to install airflow.
cdto the Airflow project root directory.- run
export AIRFLOW_HOME=$PWD
Once the airflow home directory is set, create a virtual environment inside the root directory.
python 3 -m venv .airflow_venv
# activate the virtual environment
source ./.airflow_venv/bin/activate
# upgrade pip
pip install --upgrade pip
Once inside the virtual environment, we can start installing the airflow.using the follwoing command.
AIRFLOW_VERSION=3.2.0
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
Once installed, verify the installation using airflow version.
To start using the airflow, we need to setup the database, create user, and spin up components like api-server, scheduler, etc.
airflow standalone command performs all the steps automatically.
Here, however, we perform all the setup manually.
- Create the database using the comamnd
airflow db migrate. - Once the database is created, run the api server using
airflow api-server --port 8080. This spins up the airflow server. Upon running the server, the password is auto-generated for each user. It creates a filesimple_auth_manager_passwords.json.generatedin the airflow home that stores the username and password of the user. - Go to the host url specified in the output of the previous command say
http://0.0.0.0:8080in you browser and enter the login details. The Airflow web-ui should start. - Next, we need to start three separate processes viz.,
scheduler,dag-processor, andtrigger. In the web-ui, they would be appearing as unhealthy under the Health section. Start these processes.
airflow scheduler
airflow dag-processor
airflow triggerer
run these commands in separate terminals.
1.2. Providers
This command installs the core Airflow scheduling without all the providers it used to come before version 3.x.
Providers are packages that allow the airflow to interface with the external services like Google cloud, OpenAI API, etc.
They can be installed separately to extend the functionality of Airflow.
They come as apache-airflow-providers. For ex, the cloud service providers are
apache-airflow-providers-amazonapache-airflow-providers-google
These providers also have their corresponding extras that can be used to install them during Airflow installation as pip install apache-airflow[google, amazon] ....
For AI integration, one can install the following providers.
apache-airflow-providers-common-aiapache-airflow-providers-openai
Or, databases as
apache-airflow-providers-postgresapache-airflow-providers-mysqlapache-airflow-providers-sqliteapache-airflow-providers-qdrantapache-airflow-providers-pinecone
HTTP
apache-airflow-providers-http
The list of providers can be accessed from Operators and Hooks Reference.
2. Introduction
Airflow is an workflow orchestration tool. It is to automate any workflow that consists of multiple processes working in sequence. The core components of the Airflow consists of DAGs and Tasks. A DAG is a representation of the workflow. It stands for Directed Acyclic Graph. As the name is descriptive itself. It means any workflow that follows a particular order, and the execution order does not form closed cycles can be automated in Airflow. Tasks are what the nodes of the DAGs. They represent the actual processes which sit in the DAG process.
3. DAG
DAG is an abstraction of the execution order of processes. In many scenarios, the processes follow a certain order and do not repeat in a cycle. Consider a classic example of Extract, Transform, and Load (ETL). In the simplest case, the data is extracted from a source, the extracted data is transformed,a dn then loaded to a storage. The three tasks E, T, and L occur in sequence, The sequence of their execution can be represented by a linear DAG.
3. Tasks
The tasks in the Airflow define the atomic execution processes in the DAG nodes.
The operators in airflow define the atomic execution processes that are done during DAG execution.
The operator are what form the nodes if the DAG. Airflow provides various operators out of the box.
One such operator is PythonOperator. The PythonOperator executes takes a callable python function and executes it according to the schedule it is defined to be executed.
A task is an instance of the operator. One can either define tasks by instantiating the operators or by decorating a python function with a @task() decorator.
3.1 Operators
3.2 TaskFlow
3.3 Sensors
3.4 Deferrable Operators and Triggers
Operators do not run all the time. Often, they need to wait for some other processes to finish before it can continue its execution. However, eventhough the task is not actually executing it occupies the slot in the worker node. As more and more tasks occupy the worker slot, the worker may not be able to contain all the tasks. To overcome this issue, the Deferrable operators are defined.
Deferrable operators leave the worker nodes once they enter the idle state instead of waiting for the dependency processes to finish, sitting in the worker node. This leaves room for other tasks in the worker node. A deferrable operator is tied to a Trigger. Each triggerer is bind to a specific event that it fires once certain conditions are met. When the task becomes idle, the the execution shifts to the triggerer. The triggerers are lightweight python codes that run