Striim is excited to introduce to you our fully-managed and purpose-driven service for Databricks. In this demo, you will see how simple overall data pipeline configuration is between Oracle to Databricks. You will be able to set up a pipeline in under 5 minutes and watch the data in Databricks in real-time.
Striim for Databricks is the first fully-managed and purpose-built streaming service for Databricks in the industry. Designed for everyone, you do not need to have any prior E T L expertise, and the simplified user experience requires little to no coding. This solution also offers reduced Total cost of ownership with consumption based metering and billing.
In this demo, we will demonstrate creating an Oracle to Databricks pipeline and moving your data in a few simple steps. In addition to this video, we have added inline documentation to help you understand the on screen information.
When you launch the service, you will be brought to the Create a Pipeline screen. On this screen, you will enter in a Data Pipeline Name.
Then, you will connect to Databricks. Connection details to Databricks are saved so you can reuse it for future pipeline configurations. In this demo, we will create a new connection to Databricks that requires the account keys to be entered. The Service automatically validates the connection and checks for all the necessary prerequisites.
Introducing Striim for Databricks
Similar to the target, the service will save the source connection details for future use. Prerequisite checks are run against the source as well and the report will be shown to you.
If the connection is valid, the service identifies the schema on the Oracle source and presents the list for you to select the correct one.
The service then checks for the compatibility of the source schema with Databricks and presents the table list for your selection. While selecting your tables, you can also choose the transformation per table that will be applied as the data flows through the pipeline in real time. For this demo, let’s choose to mask this specific column’s data.
Striim for Databricks also offers intelligent performance optimization with parallel data processing by grouping the tables.
A summary is shown in case you choose to make modifications before running the pipeline. In this demo, we reviewed it and started the first pipeline. As you can see, within a few seconds the pipeline was created and started to move the initial load automatically.
Striim for Databricks also has an intuitive overview dashboards and monitoring screens. The source and target statuses are displayed here on the Overview screen. In this case, Oracle is online and green, and Databricks is Paused which means the data is not flowing yet between the source and target. We will review our Oracle data and Databricks to ensure the data flow is going to move smoothly.
Let’s check what’s happening in our source and target. First, we will go into Oracle to check the number of records that have been moved through change data capture (cdc) for each table. Then we will check in Databricks that the same number of records have been updated for each table. We can also review the tables and columns that we have masked to ensure it processed correctly.
We will also use the Manage Tables in Pipeline feature to remove any tables that we no longer want to stream.
Next, we will use the Optimize Pipeline Performance screen which will show us which tables in the pipeline may be causing issues in the data stream. We can then pause the pipeline to optimize performance by creating table groupings and reducing the time spent between batches being sent to Databricks.
Now we will go back to the Oracle database and insert values into the source table. As you can see, the pipeline immediately recognizes the changes on Oracle and starts capturing the changes in real time. If we run a query on Databricks now to check the changes made to that table, we will see the C D C events are already available.
Thanks for watching! You have now seen our seamless, automated, and real time data capture using Striim for Databricks service.