Pipeline Components
Overview
The Real Time Data Ingestion Pipeline Framework supports the following component types:
- Sources - connectors to source systems
- Transformers - perform transformations on data, including data cleansing, data enrichment, data aggregation, data masking, data encryption, data decryption, data validation, data conversion, data normalization, data de-normalization, data partitioning etc
- Destinations - connectors to sink/destination systems
- Utilities - components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance etc
- Secrets - components that facilitate accessing secret stores where sensitive information is stored such as passwords, connectiong strings, keys etc
Component Types
Python | Apache Spark | Databricks |
---|---|---|
Component Types determine system requirements to execute the component:
- Python - components that are written in python and can be executed on a python runtime
- Pyspark - components that are written in pyspark can be executed on an open source Apache Spark runtime
- Databricks - components that require a Databricks runtime
Note
RTDIP are continuously adding more to this list. For detailed information on timelines, read this blog post and check back on this page regularly.
Sources
Sources are components that connect to source systems and extract data from them. These will typically be real time data sources, but also support batch components as these are still important and necessary data souces of time series data in a number of circumstances in the real world.
Note
This list will dynamically change as the framework is further developed and new components are added.
Transformers
Transformers are components that perform transformations on data. These will target certain data models and common transformations that sources or destination components require to be performed on data before it can be ingested or consumed.
Note
This list will dynamically change as the framework is further developed and new components are added.
Destinations
Destinations are components that connect to sink/destination systems and write data to them.
Destination Type | Python | Apache Spark | Databricks | Azure | AWS |
---|---|---|---|---|---|
Delta | |||||
Delta Merge | |||||
Eventhub | |||||
Kakfa | |||||
Eventhub Kakfa | |||||
Kinesis | |||||
Rest API | |||||
Process Control Data Model To Delta | |||||
Process Control Data Model Latest Values To Delta | |||||
EVM |
Note
This list will dynamically change as the framework is further developed and new components are added.
Utilities
Utilities are components that perform utility functions such as logging, error handling, data object creation, authentication, maintenance and are normally components that can be executed as part of a pipeline or standalone.
Utility Type | Python | Apache Spark | Databricks | Azure | AWS |
---|---|---|---|---|---|
Spark Session | |||||
Spark Configuration | |||||
Delta Table Create | |||||
Delta Table Optimize | |||||
Delta Table Vacuum | |||||
AWS S3 Bucket Policy | |||||
AWS S3 Copy | |||||
ADLS Gen 2 ACLs | |||||
Azure Autoloader Resources | |||||
Spark ADLS Gen 2 Service Principal Connect |
Note
This list will dynamically change as the framework is further developed and new components are added.
Secrets
Secrets are components that perform functions to interact with secret stores to manage sensitive information such as passwords, keys and certificates.
Secret Type | Python | Apache Spark | Databricks | Azure | AWS |
---|---|---|---|---|---|
Databricks Secret Scopes | |||||
Hashicorp Vault | |||||
Azure Key Vault |
Note
This list will dynamically change as the framework is further developed and new components are added.
Conclusion
Components can be used to build RTDIP Pipelines which is described in more detail here.