Skip to main content

Striim Cloud 4.1.0 documentation

Delta Lake Writer for Azure Databricks

Delta Lake is an open-source tabular storage format best known through Databricks' implementation. It includes a transaction log that supports features such as ACID transactions and optimistic concurrency control typically associated with relational databases. For more information, see delta.io and Diving Into Delta Lake: Unpacking The Transaction Log.

Delta Lake Writer writes to tables in Azure Databricks. For more information see Azure Databricks on databricks.com and Azure Databricks and Azure Databricks documentation on microsoft.com. In this release, Delta Lake Writer supports only Azure Databricks and no other implementation of Delta Lake.

Writing to Databricks requires a staging area that uses the Databricks File System. DBFS has as a 2 GB cap on storage, which can cause file corruption. To work around that limitation, we strongly recommend mounting an Azure ADLS Gen1 or Gen2 object store on DBFS (see Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0 or Mount ADLS Gen2 storage).

If you will use MERGE mode, we strongly recommend partitioning your target tables as this will significantly improve performance (see Partitions in the Azure Databricks documentation).

Data is written in batch mode (see Table batch reads and writes). Streaming mode is not supported in this release because it is not supported by Databricks Connect (see Databricks Connect - Limitations).

Known issue DEV-29579: in this release, Delta Lake Writer can not be used when Striim is running in Microsoft Windows.

Azure Databricks setup

  1. If already have an Azure Databricks workspace, skip this step.

    To create a. workspace, find "Azure Databricks by Microsoft" in the Azure Marketplace, select it, click Get It Now, and follow the prompts. You may select either the Standard or Premium pricing tier. Once deployment is complete, click Go to resource.

  2. On the Databricks workspace Overview page, click Launch Workspace.

  3. In the left navigation pane, click Compute, then click Create Cluster.

  4. Give the cluster any name you like. For Databricks runtime version select 9.1 LTS (this is the only version currently certified by Striim). Optionally, change any other options, then click Create Cluster. Once Databricks has created the cluster it will start automatically.

Creating a Databricks target using a template

When you create a Delta Lake Writer target using a wizard template (see Creating apps using templates), you must specify three properties: Connection URL, Hostname, and Personal Access Token. The Tables property value will be set based on your selections in the wizard.

Databricks does not have schemas. When the source database uses schemas, the tables will be mapped as <source_database>.<source_schema>.<table>,<target_database>.<table>, for example, mydb.myschema.%,mydb.%. Each schema in the source will be mapped to a database in the target. If the databases do not exist in the target, Strilm will create them.

Delta Lake Writer properties

When creating a Delta Lake Writer target in TQL, you must specify values for the Connection URL, Hostname, Personal Access Token, and Tables properties. If not specified, the other properties will use their default values.

property

type

default value

notes

CDDL Action

String

Process

See Handling schema evolution.

Connection Retry Policy

String

initialRetryDelay=10s, retryDelayMultiplier=2, maxRetryDelay=1m, maxAttempts=5, totalTimeout=10m

Do not change unless instructed to by Striim support.

Connection URL

String

the JDBC URL from the JDBC/ODBC tab of the Databricks cluster's Advanced options (see Get connection details for a cluster)

Hostname

String

the Server Hostname from the JDBC/ODBC tab of the Databricks cluster's Advanced options (see Get connection details for a cluster)

Ignorable Exception Code

String

Set to TABLE_NOT_FOUND to prevent the application from crashing when Striim tries to write to a table that does not exist in the target. See Handling "table not found" errors for more information.

Mode

enum

AppendOnly

With the default value AppendOnly:

  • Updates and deletes from DatabaseReader, IncrementalBatchReader, and SQL CDC sources are handled as inserts in the target.

  • Primary key updates result in two records in the target, one with the previous value and one with the new value. If the Tables setting has a ColumnMap that includes @METADATA(OperationName), the operation name for the first event will be DELETE and for the second INSERT.

Set to Merge to handle updates and deletes as updates and deletes instead. In Merge mode:

  • Since Delta Lake does not have primary keys, you may include the keycolumns option in the Tables property to specify a column in the target table that will contain a unique identifier for each row: for example, Tables:'SCOTT.EMP,mydatabase.employee keycolumns(emp_num)'.

  • You may use wildcards for the source table provided all the tables have the key columns. For example, Tables:'DEMO.%,mydatabase.% KeyColumns(...)'.

  • If you do not specify keycolumns , Striim will use the source table's keycolumns as a unique identifier. If the source table has no keycolumns, Striim will concatenate all column values and use that as a unique identifier.

Optimized Merge

Boolean

false

In Flow Designer, this property will be displayed only when Mode is Merge.

Set to True only when Mode is MERGE and the target's input stream is the output of an HP NonStop reader, MySQL Reader, or Oracle Reader source and the source events will include partial records. For example, with Oracle Reader, when supplemental logging has not been enabled for all columns, partial records are sent for updates. When the source events will always include full records, leave this set to false.

Parallel Threads

Integer

Not supported when Mode is Merge.

See Creating multiple writer instances.

Personal Access Token

encrypted password

Used to authenticate with the Databricks cluster (see Generate a personal access token). The user associated with the token must have read and write access to DBFS (see Important information about DBFS permissions). If table access control has been enabled, the user must also have MODIFY and READ_METADATA (see Data object privileges - Data governance model).

Stage Location

String

/

the path to the staging area in DBFS, for example, /StriimStage/

Tables

String

The name(s) of the table(s) to write to. The table(s) must exist in the database.

Specify target table names in uppercase as <database>.<table>.

When the target's input stream is a user-defined event, specify a single table.

The only special character allowed in target table names is underscore (_).

When the input stream of the writer is the output of a DatabaseReader, IncrementalBatchReader, or SQL CDC source it can write to multiple tables. In this case, specify the names of both the source and target tables. You may use the % wildcard for the table names, but not for the schema or database. If the reader uses three-part names, you must use them here as well. SQL Server source table names must be specified in three parts when the source is Database Reader or Incremental Batch Reader (database.schema.%,database.%), but in two parts when the source is MS SQL Reader or MS Jet (schema.%,database.%). Examples:

source.emp,target_database.emp
source_schema.%,target_database.%
source_database.source_schema.%,target_database.%

MySQL and Oracle names are case-sensitive, SQL Server names are not. Specify names as <schema name>.<table name> for MySQL and Oracle and as <database name>.<schema name>.<table name> for SQL Server.

See Mapping columns for additional options.

Upload Policy

String

eventcount:10000, interval:60s

The upload policy may include eventcount and/or interval (see Setting output names and rollover / upload policies for syntax). Buffered data is written to the storage account every time any of the specified values is exceeded. With the default value, data will be written every 60 seconds or sooner if the buffer contains 10,000 events. When the app is undeployed, all data remaining in the buffer is written to the storage account.

Sample TQL application using Delta Lake Writer

Sample TQL in AppendOnly mode:

CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( 
  personalAccessToken: '*************************', 
  hostname:'adb-xxxx.xx.azuredatabricks.net',
  tables: 'mydb.employee,mydatabase.employee', 
  stageLocation: '/StriimStage/', 
  connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;',
  stageLocation: '/StriimStage/'
)
INPUT FROM ns1.sourceStream;

Sample TQL in Merge mode with Optimized Merge set to True:

CREATE TARGET DatabricksAppendOnly USING DeltaLakeWriter ( 
  personalAccessToken: '*************************', 
  hostname:'adb-xxxx.xx.azuredatabricks.net',
  tables: 'mydb.employee,mydatabase.employee', 
  stageLocation: '/StriimStage/', 
  connectionUrl:'jdbc:xxx.xx;transportMode=http;ssl=1;httpPath=xxx;AuthMech=3;UID=token;',
  stageLocation: '/StriimStage/',
  mode: 'MERGE',
  optimizedMerge: 'true'
)
INPUT FROM ns1.sourceStream;

Delta Lake Writer data type support and mapping

TQL type

Delta Lake type

java.lang.Byte

binary

java.lang.Double

double

java.lang.Float

float

java.lang.Integer

int

java.lang.Long

bigint

java.lang.Short

smallint

java.lang.String

string

org.joda.time.DateTime

timestamp

For additional data type mappings, see Data type support & mapping for schema conversion & evolution.