GCS Reader runtime considerations
Monitoring metrics
The following monitoring metrics are published by the GCS Reader:
Count of cloud objects metadata fetched: The number of object metadata fetched in the last fetch.
External I/O latency: The latency of the last metadata fetch call.
Name of the last cloud objects metadata fetched.
Cloud object statistics:
Count of cloud objects metadata fetched: Total objects metadata fetched so far.
Downloaded count: Number of files downloaded.
Processed count: Number of files processed.
Missing count: Number of files deleted in bucket after fetching metadata.
Total object size in MB: Total size in MB of all objects metadata fetched so far.
Total downloaded size in MB: Total size in MB of all downloaded objects. This metric is not published for the
UseStreaming
option.Disk utilization in MB: Current disk utilization of the download directory (
.striim/componentname/
).
Current filename.
Last file read.
Performance optimizations
Object fetching mode: With the streaming approach (Use Streaming property) performance is expected to be faster as the bytes are streamed directly instead of requiring additional download steps. Local testing shows for sample data of 411 DSV files of varying size with 1M events in total, the download approach took 162 seconds vs 55 seconds by the streaming approach.
Object detection mode: The GCSAuditLogNotification object detection mode provides better performance during app recovery after a crash/stop when a bucket contains a huge number (in the order of millions) of objects. This is because the reader does not need to fetch the full metadata to locate the check-pointed object.
Limitations
The following limitations apply to the GCS Reader:
The GCS Reader can read Avro files with an embedded schema, but not with a separate Avro schema file.
The GCS Reader adapter's download mode is not supported on Windows OS.
If the object name is bigger than what current OS filename length supports, then you should enable the
Use Streaming
option to avoid exceptions from downloading a filename larger than what the OS supports.If a bucket contains a huge number of objects, the reader may consume a high level of memory and CPU to fetch and process the metadata. This applies to both the GCSDirectoryListing and GCSAuditLogNotification modes.
For the
GCSDirectoryListing
mode, a full metadata fetch happens when the adapter starts and for every subsequent polling fetch.For the
GCSAuditLogNotification
mode, a full metadata fetch happens when the adapter starts, and subsequent polling calls fetch only the incremental changes from the audit log.In
GCSDirectoryListing
mode, if the bucket contains a huge number (in the order of millions) of objects, app recovery after crash/stop will take a considerable time since the full metadata has to be fetched to locate the checkpointed object. You are recommended to use theGCSAuditLogNotification
mode for better performance.In the
GCSAuditLogNotification
mode, the Google cloud provider has a set default limit (60) on the number of requests per min on reading the audit log. If you are running multiple apps then you should set the polling interval based on the number of apps you are running and the audit log read limit.A time offset of 5 minutes is applied to queries to avoid a conflict during high volume data loading. To modify the 5 minutes default, contact Striim support.
Troubleshooting
This topic describes errors you may see when using the GCS Reader, and possible resolutions.
Exception | Resolutions |
---|---|
GoogleCloudBucketNotFoundException | Check if the specified bucket is present. When using Private Service Connect, verify:
|
GoogleCloudCredentialsException |
|
GoogleCloudLocalFileSystemException |
|
CloudStorageConnectionException |
|
GoogleCloudPermissionException | Ensure that the user has the following permissions on the bucket/audit log:
|