RTDIP Ingestion Pipeline Framework
RTDIP has been built to simplify ingesting and querying time series data. One of the most anticipated features of the Real Time Data Ingestion Platform for 2023 is the ability to create streaming and batch ingestion pipelines according to requirements of the source of the data and needs of the data consumer. Of equal importance is the need to query this data and an article that focuses on egress will follow in due course.
Overview
The goal of the RTDIP Ingestion Pipeline framework is:
- Support python and pyspark to build pipeline components
- Enable execution of sources, transformers, sinks/destinations and utilities components in a framework that can execute them in a defined order
- Create modular components that can be leveraged as a step in a pipeline task using Object Oriented Programming techniques included Interfaces and Implementations per component type
- Deploy pipelines to popular orchestration engines
- Ensure pipelines can be constructed and executed using the RTDIP SDK and rest APIs
Pipeline Jobs
The RTDIP Data Ingestion Pipeline Framework will follow the typical convention of a job that users will be familiar with if they have used orchestration engines such as Apache Airflow or Databricks Workflows.
A pipeline job consists of the following components:
erDiagram
JOB ||--|{ TASK : contains
TASK ||--|{ STEP : contains
JOB {
string name
string description
list task_list
}
TASK {
string name
string description
string depends_on_task
list step_list
bool batch_task
}
STEP {
string name
string description
list depends_on_step
list provides_output_to_step
class component
dict component_parameters
}
As per the above, a pipeline job consists of a list of tasks. Each task consists of a list of steps. Each step consists of a component and a set of parameters that are passed to the component. Dependency Injection will ensure that each component is instantiated with the correct parameters.
Pipeline Runtime Environments
Python | Apache Spark | Databricks | Delta Live Tables |
---|---|---|---|
Pipelines will be able to run in multiple environment types. These will include:
- Python: Components will be written in python and executed on a python runtime
- Pyspark: Components will be written in pyspark and executed on an open source Apache Spark runtime
- Databricks: Components will be written in pyspark and executed on a Databricks runtime
- Delta Live Tables: Components will be written in pyspark and executed on a Databricks runtime and will write to Delta Live Tables
Runtimes will take precedence depending on the list of components in a pipeline task.
- Pipelines with at least one Databricks or DLT component will be executed in a Databricks environment
- Pipelines with at least one Pyspark component will be executed in a Pyspark environment
- Pipelines with only Python components will be executed in a Python environment
Pipeline Clouds
Certain components are related to cloud providers and in the tables below, it is indicated which cloud provider is related to its specific component. It does not mean that the component can only run in that cloud, instead its highlighting that the component is related to that cloud provider.
Cloud | Target |
---|---|
Azure | Q1-Q2 2023 |
AWS | Q2-Q4 2023 |
GCP | 2024 |
Pipeline Orchestration
Airflow | Databricks | Dagster |
---|---|---|
Pipelines will be able to be deployed to orchestration engines so that users can schedule and execute jobs using their preferred orchestration engine.
Orchestration Engine | Target |
---|---|
Databricks Workflows | Q2 2023 |
Airflow | Q2 2023 |
Delta Live Tables | Q3 2023 |
Dagster | Q4 2023 |
Pipeline Components
The Real Time Data Ingestion Pipeline Framework will support the following components:
- Sources - connectors to source systems
- Transformers - perform transformations on data, including data cleansing, data enrichment, data aggregation, data masking, data encryption, data decryption, data validation, data conversion, data normalization, data de-normalization, data partitioning etc
- Destinations - connectors to sink/destination systems
- Utilities - components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance etc
- Edge - components that will perform edge functionality such as connectors to protocols like OPC
Pipeline Component Types
Python | Apache Spark | Databricks |
---|---|---|
Component Types determine system requirements to execute the component:
- Python - components that are written in python and can be executed on a python runtime
- Pyspark - components that are written in pyspark can be executed on an open source Apache Spark runtime
- Databricks - components that require a Databricks runtime
Sources
Sources are components that connect to source systems and extract data from them. These will typically be real time data sources, but will also support batch components as these are still important and necessary data souces in a number of circumstances in the real world.
Source Type | Python | Apache Spark | Databricks | Azure | AWS | Target |
---|---|---|---|---|---|---|
Delta | * | Q1 2023 | ||||
Delta Sharing | * | Q1 2023 | ||||
Autoloader | Q1 2023 | |||||
Eventhub | * | Q1 2023 | ||||
IoT Hub | * | Q2 2023 | ||||
Kafka | Q2 2023 | |||||
Kinesis | Q2 2023 | |||||
SSIP PI Connector | Q2 2023 | |||||
Rest API | Q2 2023 | |||||
MongoDB | Q3 2023 |
* - target to deliver in the following quarter
There is currently no spark connector for IoT Core. If you know a way to add it as a source component, please raise it by creating an issue on the GitHub repo.
Transformers
Transformers are components that perform transformations on data. These will target certain data models and common transformations that sources or destination components require to be performed on data before it can be ingested or consumed.
Transformer Type | Python | Apache Spark | Databricks | Azure | AWS | Target |
---|---|---|---|---|---|---|
Eventhub Body | Q1 2023 | |||||
OPC UA | Q2 2023 | |||||
OPC AE | Q2 2023 | |||||
SSIP PI | Q2 2023 | |||||
OPC DA | Q3 2023 |
* - target to deliver in the following quarter
This list will dynamically change as the framework is developed and new components are added.
Destinations
Destinations are components that connect to sink/destination systems and write data to them.
Destination Type | Python | Apache Spark | Databricks | Azure | AWS | Target |
---|---|---|---|---|---|---|
Delta Append | * | Q1 2023 | ||||
Eventhub | * | Q1 2023 | ||||
Delta Merge | Q2 2023 | |||||
Kafka | Q2 2023 | |||||
Kinesis | Q2 2023 | |||||
Rest API | Q2 2023 | |||||
MongoDB | Q3 2023 | |||||
Polygon Blockchain | Q3 2023 |
* - target to deliver in the following quarter
Utilities
Utilities are components that perform utility functions such as logging, error handling, data object creation, maintenance and are normally components that can be executed as part of a pipeline or standalone.
Utility Type | Python | Apache Spark | Databricks | Azure | AWS | Target |
---|---|---|---|---|---|---|
Delta Table Create | * | Q1 2023 | ||||
Delta Optimize | Q2 2023 | |||||
Delta Vacuum | * | Q2 2023 | ||||
Set ADLS Gen2 ACLs | Q2 2023 | |||||
Set S3 ACLs | Q2 2023 | |||||
Great Expectations | Q3 2023 |
* - target to deliver in the following quarter
Secrets
Secrets are components that perform authentication functions and are normally components that can be executed as part of a pipeline or standalone.
Secrets Type | Python | Apache Spark | Databricks | Azure | AWS | Target |
---|---|---|---|---|---|---|
Databricks Secrets | Q2 2023 | |||||
Hashicorp Vault | Q2 2023 | |||||
Azure Key Vault | Q3 2023 | |||||
AWS Secrets Manager | Q3 2023 |
Edge
Edge components are designed to provide a lightweight, low latency, low resource consumption, data ingestion framework for edge devices. These components will be designed to run on edge devices such as Raspberry Pi, Jetson Nano, etc. For cloud providers, this will be designed to run on AWS Greengrass and Azure IoT Edge.
Edge Type | Azure IoT Edge | AWS Greengrass | Target |
---|---|---|---|
OPC CloudPublisher | Q3-Q4 2023 | ||
Fledge | Q3-Q4 2023 | ||
Edge X | Q3-Q4 2023 |
Conclusion
This is a very high level overview of the framework and the components that will be developed. As the framework is open source, the lists defined above and timelines can change depending on circumstances and resource availability. Its an exciting year for 2023 for the Real Time Data Ingestion Platform. Check back in regularly for updates and new features! If you would like to contribute, please visit our repository on Github and connect with us on our Slack channel on the LF Energy Foundation Slack workspace.