GCS Reader
Google Cloud Storage is a file storage web service for storing and accessing customer data. It provides unified file storage for data. You can use GCS Reader to read data from a Google Cloud Storage bucket.
Summary
APIs used/data supported |
|
Supported parsers | AAL (Apache access log), Avro, Binary, DSV, Free Form Text, JSON, NVP (name-value pair), Parquet, XML |
Supported targets | All targets supported by Striim. See Writers overview. |
Security and authentication | GCS Reader supports private endpoints. See Using Private Service Connect with Google Cloud adapters. Access to Google Cloud Storage requires:
|
Operations / modes supported |
|
Modes for fetching data |
|
Resilience / recovery |
|
Performance |
|
Programmability |
|
Metrics and auditing | Key metrics available through Striim monitoring. See Monitoring metrics. |
Key limitations | No support for reading encrypted GCS objects. |
Typical use case and integration
A typical use case is using Google Cloud Storage as a centralized repository that stores, processes, and secures files of any format. Striim is used to process and enrich the newly ingested data, and send the data to data warehouses for analytics.
GCS Reader overview
Files stored in Google Cloud Storage are grouped into buckets. Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organize and control access to your data.
GCS Reader works by connecting to Google Cloud Storage and fetching file metadata from the bucket that you specify. The reader requires a valid service account key credentials JSON file to access the files in the the GCS bucket. GCS Reader supports different file formats such as JSON and CSV, and supports reading from a single folder and reading data recursively from a folder and its subfolders.
The reader processes the files in the bucket as initial load or incremental loads. There are two different modes for file detection. The GCSDirectoryListing mode performs a full metadata fetch when the adapter starts and for every subsequent polling fetch. The GCSAuditLogNotification mode performs a full metadata fetch when the reader starts and subsequent polling calls fetch only incremental changes from the audit log generated by the GCS bucket.
GCS Reader supports two ways to fetch the data. In the default streaming mode, GCS Reader fetches the file data directly from Google Cloud Storage by opening a remote InputStream, streaming the bytes from the remote file, and processing the incoming stream. Alternatively, you can use the download mode where GCS Reader first downloads the files to a local folder and then processes them.
After a file has been processed successfully, Striim deletes that downloaded file from local storage. The download mode is recommended only when an entire file must be available before beginning processing. Currently this recommendation applies only to Parquet files, as these files have a schema in the file footer. In the event of a system crash, the whole Striim application halts. Upon restart of the application, the contents of the local download folder are cleared and unprocessed files are downloaded again.