RTDIP Ingestion Pipeline Framework

RTDIP has been built to simplify ingesting and querying time series data. The RTDIP Ingestion Pipeline Framework creates streaming and batch ingestion pipelines according to requirements of the source of the data and needs of the data consumer. RTDIP Pipelines focuses on the ingestion of data into the platform.

Prerequisites

Ensure that you have followed the installation instructions as specified in the Getting Started section and follow the steps which highlight the installation requirements for Pipelines. In particular:

RTDIP SDK Installation
Java - If your pipeline steps utilize pyspark then Java must be installed.

RTDIP SDK installation

Ensure you have installed the RTDIP SDK, as a minimum, as follows:

pip install "rtdip-sdk[pipelines]"

For all installation options please see the RTDIP SDK installation instructions.

Overview

The goal of the RTDIP Ingestion Pipeline framework is to:

Support python and pyspark to build pipeline components
Enable execution of sources, transformers, destinations and utilities components in a framework that can execute them in a defined order
Create modular components that can be leveraged as a step in a pipeline task using Object Oriented Programming techniques included Interfaces and Implementations per component type
Deploy pipelines to popular orchestration engines
Ensure pipelines can be constructed and executed using the RTDIP SDK and rest APIs

Jobs

The RTDIP Data Ingestion Pipeline Framework follow sthe typical convention of a job that users will be familiar with if they have used orchestration engines such as Apache Airflow or Databricks Workflows.

A pipline job consists of the following components:

erDiagram
  JOB ||--|{ TASK : contains
  TASK ||--|{ STEP : contains
  JOB {
    string name
    string description
    list task_list
  }
  TASK {
    string name
    string description
    string depends_on_task
    list step_list
    bool batch_task
  }
  STEP {
    string name
    string description
    list depends_on_step
    list provides_output_to_step
    class component
    dict component_parameters
  }

As per the above, a pipeline job consists of a list of tasks. Each task consists of a list of steps. Each step consists of a component and a set of parameters that are passed to the component. Dependency Injection will ensure that each component is instantiated with the correct parameters.

More Information about Pipeline Jobs can be found here.

Runtime Environments

Python	Apache Spark	Databricks	Delta Live Tables

Note

RTDIP are continuously adding more to this list. For detailed information on timelines, read this blog post and check back on this page regularly.

Pipelines can run in multiple environment types. These include:

Python: Components written in python and executed on a python runtime
Pyspark: Components written in pyspark and executed on an open source Apache Spark runtime
Databricks: Components written in pyspark and executed on a Databricks runtime
Delta Live Tables: Components written in pyspark and executed on a Databricks Delta Live Tables runtime

Runtimes will take precedence depending on the list of components in a pipeline task.

Pipelines with at least one Databricks or DLT component will be executed in a Databricks environment
Pipelines with at least one Pyspark component will be executed in a Pyspark environment
Pipelines with only Python components will be executed in a Python environment

Conclusion

Find out more about the components that can be used by the RTDIP Ingestion Pipeline Framework here.