Data Pipeline Service¶

Data Pipeline Service can be used to start a Data Pipeline that can orchestrate several Python scripts on Apache Spark Cluster ( run on the IFS.ai Platform in a multi-tenant way). Argo Workflow is used as the Orchestration Engine which determines how the Spark applications should run in different conditions.

The main functions of the Data Pipeline service in IFS Cloud Web are,

Start a Pipeline
Get the status of a Pipeline and all related Logs

About Argo Workflow and Spark Operator¶

Data Services use Python scripts to process the data. In order to execute the Python script, Apache Spark Cluster is used. This is created dynamically and the cluster is terminated once the job is completed.

To create the Spark Cluster, a Spark base image is required.

The Spark base image is used in the Spark application , to create the spark cluster and execute Python script(s).

These Spark applications are required to be defined in the Argo Workflow.

Argo Workflow is managed by Argo Operator. When a Workflow request is submitted via the Data Pipeline Service, the Argo Operator detects the change (new request) and the Workflow is created.

How a Workload Job Definition Operates:¶

A Workload Job Definition can be used to start a Data Pipeline and it consists of Parquet Data Sources and Argo Workflows (Actions). A Workload Job Definition can be scheduled or triggered for execution, based on the requirement.

When a defined Workload Job Definition is triggered for a Run (explicit trigger or scheduled), a Workload Run is created. It will first initiate loading the Data Sources. Data is pumped to the relevant Data Lake via the Data Lake service. Once a Data Source is loaded, the relevant Last Refreshed value is updated.
Next, the Data Pipeline service is triggered. This will initiate the Argo Workflow.
- From IFS Cloud the Data Pipeline service is invoked where the Workload Action = Workflow ( example: esg-kpi-external-data-workflow ).
Tenant information is determined by the Data Pipeline service and it passes the Data Lake and connection information to the Workflow.

Read More about obtaining the Azure Tenant ID.
The Spark Jobs run against the specific Data Lake therefore the connection information needs to be accessible from within the Spark scripts to set up the connection. The Workflow passes the information to the script, therefore script has information to access the Data Lake.
Then the script is executed.
The status of the Argo Workflow is monitored by the Scheduler until the process is completed, and the Workload Run is updated.