Skip to main content

ADLS Reader

Note

ADLS Reader and ADLS Gen2 Writer read from and write to the same Azure Data Lake Storage (ADLS) service. Microsoft dropped "Gen2" from the name of the service after it retired the ADLS Gen1 service in February 2024.

ADLS Reader is a Striim source adapter that is capable of reading objects from a container in an ADLS Gen2 storage account with hierarchical structure enabled. Azure Data Lake Storage (ADLS) is a scalable data store offered by Microsoft Azure Cloud. Data is stored in this storage as blobs and this can be accessed using many other Azure components.

Summary

APIs used/data supported

ADLS Gen2 Reader is shipped with the following Java client libraries that are published and maintained by Microsoft Azure.

  • azure-storage-file-datalake, version 12.18.2

  • azure-identity, version 1.11.2

  • azure-monitor-query, version 1.2.9

These libraries ensure that ADLS Reader works seamlessly on both new object detection modes and with Active Directory for authentication.

Supported parsers

AAL (Apache access log), Avro, Binary, DSV, Free Form Text, JSON, NVP (name-value pair), Parquet, XML

Supported targets

All targets supported by Striim. See Writers overview.Writers overview

Security and authentication

The authentication and authorization required for ADLS Reader are set up using the Microsoft Entra ID, formerly Azure Active Directory, service. This requires you to create a Microsoft Entra ID application with the appropriate roles or privileges. You should authorize the Entra ID application should be authorized to access the storage account and depending on the Object Detection Mode, the Log Analytics Workspace as well.

The details of the Microsoft Entra ID application such as Client ID, Tenant ID and Client Secret are used by ADLS Reader.

Encryption support: ADLS Reader handles all three types of Server Side Encryption provided by ADLS Gen2 storage. Based on the key preference, they are classified as Microsoft-managed keys, Customer-managed keys and Customer-provided keys. If a Customer-provided key is used for Server Side Encryption, the same key should also be configured in the ADLS Reader.

Operations / modes supported

ADLS Reader processes the incremental changes, object creations and updates, in the container based on the new object detection mode that can be configured. They are:

  • ADLS Directory Listing - ADLS Reader identifies the changes in each directory by comparing the metadata of the objects in two successive polls.

  • Log Analytics - ADLS Reader identifies the changes in the entire container based on the log entries over the polling duration using Microsoft Azure Log Analytics. This mode involves comparatively less computational and network overheads. Additional steps have to be performed on the storage account for ADLS Reader to use this object detection mode.

ADLS Directory Listing is the default new object detection mode.

Object processing modes

ADLS Reader processes objects after initiating a byte stream from the ADLS container. Using the streaming approach does not require the entire object to be downloaded to the Striim server to start processing.

However, not all objects can be processed using streaming. The entire contents of a Parquet file should be available before the Parquet parser can start processing. Therefore, downloading is used instead of streaming when the file parser is Parquet. The ADLS Reader throttles the download of objects by limiting the disk size to 2048MB and file limit to 10.

Resilience / recovery

  • Supports recovery with at-least-once processing (A1P) by recording checkpoints of the names and offset information of processed files. Upon restart, the reader uses the details from the last checkpoint to resume reading from the ADLS container. In some cases, data may have been read after the checkpoint was taken, in which case there may be duplicates in the target (see Recovering applications).

  • Auto retries based on the Connection Retry Policy settings. Any API call to the ADLS container is retried on a connection failure based on the Connection Retry Policy property (see ADLS Reader properties). If the adapter is still unable to connect to the ADLS container beyond the configured retry, then the app will halt with an appropriate error message.

Programmability

  • Flow Designer

  • TQL

  • Wizards in the web UI to build pipelines to targets such as databases or apps

Metrics and auditing

Key metrics available through Striim monitoring. See ADLS Reader monitoring metrics.

Typical use case and integration

A typical use case is using ADLS Reader to read from an Azure Gen2 service, and capture events related to file uploads and/or deletes from Azure Data Lake Storage using Log Analytics workspace. The storage account has been configured to allow ADLS Reader to use the Log Analytics object detection mode, which allows the reader to identify the changes in the container based on log entries over the polling duration using Microsoft Azure Log Analytics.

This use cases involves setting up monitoring using a Log Analytics workspace from an Azure Data Lake Storage resource. It includes setting up an application using Microsoft Entra ID, formerly Azure Active Directory, and setting up authentication to use the Logs Query Client to read data from the tables in the Log Analytics resource.

ADLS Reader overview

ADLS Reader is capable of connecting to and reading from an Azure Gen2 service. The ADLS Gen2 storage account is listed as a "storage account" in Azure Marketplace. An ADLS Gen2 account is a storage account whose hierarchical namespace is enabled.

ADLS Gen2 converges the capabilities of ADLS Gen1 with Azure Blob Storage. ADLS Gen2 provides file system semantics, file-level security, and scale along with low-cost, tiered storage and high availability/disaster recovery capabilities.

ADLS Gen2 offers storage for both structured and unstructured data using Storage Tables and Storage Containers respectively. Containers are capable of storing large data as files or as binary large objects (blobs). The files within a container are stored in a Hadoop compatible file system inside directories following a hierarchical structure.

ADLS Reader processes all the objects in the configured container in the increasing order of modified time when the application is started for the first time. After the existing files in the container are processed, the reader processes the newer file creations and updates by periodically polling the container.