Read from Autoloader
DataBricksAutoLoaderSource
Bases: SourceInterface
The Spark Auto Loader is used to read new data files as they arrive in cloud storage. Further information on Auto Loader is available here
Example
from rtdip_sdk.pipelines.sources import DataBricksAutoLoaderSource
from rtdip_sdk.pipelines.utilities import SparkSessionUtility
# Not required if using Databricks
spark = SparkSessionUtility(config={}).execute()
options = {}
path = "abfss://{FILE-SYSTEM}@{ACCOUNT-NAME}.dfs.core.windows.net/{PATH}/{FILE-NAME}
format = "{DESIRED-FILE-FORMAT}"
DataBricksAutoLoaderSource(spark, options, path, format).read_stream()
OR
DataBricksAutoLoaderSource(spark, options, path, format).read_batch()
from rtdip_sdk.pipelines.sources import DataBricksAutoLoaderSource
from rtdip_sdk.pipelines.utilities import SparkSessionUtility
# Not required if using Databricks
spark = SparkSessionUtility(config={}).execute()
options = {}
path = "https://s3.{REGION-CODE}.amazonaws.com/{BUCKET-NAME}/{KEY-NAME}"
format = "{DESIRED-FILE-FORMAT}"
DataBricksAutoLoaderSource(spark, options, path, format).read_stream()
OR
DataBricksAutoLoaderSource(spark, options, path, format).read_batch()
from rtdip_sdk.pipelines.sources import DataBricksAutoLoaderSource
from rtdip_sdk.pipelines.utilities import SparkSessionUtility
# Not required if using Databricks
spark = SparkSessionUtility(config={}).execute()
options = {}
path = "gs://{BUCKET-NAME}/{FILE-PATH}"
format = "{DESIRED-FILE-FORMAT}"
DataBricksAutoLoaderSource(spark, options, path, format).read_stream()
OR
DataBricksAutoLoaderSource(spark, options, path, format).read_batch()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark |
SparkSession
|
Spark Session required to read data from cloud storage |
required |
options |
dict
|
Options that can be specified for configuring the Auto Loader. Further information on the options available are here |
required |
path |
str
|
The cloud storage path |
required |
format |
str
|
Specifies the file format to be read. Supported formats are available here |
required |
Source code in src/sdk/python/rtdip_sdk/pipelines/sources/spark/autoloader.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
system_type()
staticmethod
Attributes:
Name | Type | Description |
---|---|---|
SystemType |
Environment
|
Requires PYSPARK on Databricks |
Source code in src/sdk/python/rtdip_sdk/pipelines/sources/spark/autoloader.py
106 107 108 109 110 111 112 |
|
read_batch()
Raises:
Type | Description |
---|---|
NotImplementedError
|
Auto Loader only supports streaming reads. To perform a batch read, use the read_stream method of this component and specify the Trigger on the write_stream to be |
Source code in src/sdk/python/rtdip_sdk/pipelines/sources/spark/autoloader.py
130 131 132 133 134 135 136 137 |
|
read_stream()
Performs streaming reads of files in cloud storage.
Source code in src/sdk/python/rtdip_sdk/pipelines/sources/spark/autoloader.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|