Striim’s Real-Time Data is for Much More Than Just Analytics article was originally published on Forbes.
The conversation around real-time data, fast data and streaming data is getting louder and more energetic. As the age of big data fades into the sunset — and many industry folks are even reluctant to use the term — there is much more focus on fast data and obtaining timely insights. The focus of many of these discussions is on real-time analytics (otherwise known as streaming analytics), but this only scratches the surface of what real-time data can be used for.
If you look at how real-time data pipelines are actually being utilized, you find that about 75% of the use cases are integration related. That is, continuous data collection creates real-time data streams, which are processed and enriched and then delivered to other systems. Often these other systems are not themselves streaming. The target could be a database, data warehouse or cloud storage, with a goal of ensuring that these systems are always up to date. This leaves only about 25% of companies doing immediate streaming analytics on real-time data. But these are the use cases that are getting much more attention.
There are many reasons why streaming data integration is more common, but the main reason is quite simple: This is a relatively new technology, and you cannot do streaming analytics without first sourcing real-time data. This is known as a “streaming first” data architecture, where the first problem to solve is obtaining real-time data feeds.
Organizations can be quite pragmatic about this and approach stream-enabling their sources on a need-to-have, use-case-specific basis. This could be because batch ETL systems no longer scale or batch windows have gone away in a 24/7 enterprise. Or, they want to move to more modern technologies, which are most suitable for the task at hand, and keep them continually up to date as part of a digital transformation initiative.
Cloud Is Driving Streaming Data Integration
The rise of cloud has made a streaming-first approach to data integration much more attractive. Simple use cases, like migrating an on-premise database that services an in-house business application to the cloud, are often not even viable without streaming data integration.
The naive approach would be to back up the database, load it into the cloud and point the cloud application at it. However, this assumes a few things:
1. You can afford application downtime.
2. Your application can be stopped while you are doing this.
3. You can spin up and use the cloud application without testing it.
For most business-critical applications, none of these things are true.
A better approach to minimizing or eliminating downtime is an online migration that keeps the application running. To perform this task, source changes from the in-house database, using a technology called change data capture (CDC), as real-time data streams, load the database to the cloud, then apply any changes from the real-time stream that happened while you were doing the loading. The change delivery to the cloud can be kept running while you test the cloud application, and when you cut over, it will be already up to date.
Streaming data integration is a crucial element of this type of use case, and it can also be applied to cloud bursting, operational machine learning, large scale cloud analytics or any other scenario where having up-to-the-second data is essential.
Streaming Data Integration Is The Precursor To Streaming Analytics
Once organizations are doing real-time data collection, typically for integration purposes, it then opens the door to doing streaming analytics. But you can’t put the cart before the horse and do streaming analytics unless you already have streaming data.
Streaming analytics also requires preprepared data. It’s a commonly known metric that 80% of the time spent in data science is in data preparation. This is true for machine learning and also true for streaming analytics. Obtaining the real-time data feed is just the beginning. You may also need to transform, join, cleanse and enrich data streams to give the data more context before performing analytics.
As a simple example, imagine you are performing CDC on a source database and have a stream of orders being made by customers. In any well-normalized, relational database, these tables are mostly just numbers relating to detail contained in other tables.
This might be perfect for a relational, transactional system, but it’s not very useful for analytics. However, if you can join the streaming data with reference data for customers and items, you have now added more context and more value. The analytics can now show real-time sales by customer location or item category and truly provide business insights.
Without the processing steps of streaming data integration, the streaming analytics would lose value, again showing how important the real-time integration layer really is.
Busting The Myth That Real-Time Data Is Prohibitively Expensive
A final consideration is cost. Something that has been said repeatedly is that real-time systems are expensive and should only be used when absolutely necessary. The typically cited use cases are algorithmic trading and critical control systems.
While this may have been true in the past, the massive improvements in the price-performance equation for CPU and memory over the last few decades have made real-time systems, and in-memory processing in general, affordable for mass consumption. Coupled with cloud deployments and containerization, the capability to have real-time data streamed to any system is within reach of any enterprise.
While real-time analytics and instant operational insights may get the most publicity and represent the long-term goal of many organizations, the real workhorse behind the scenes is streaming data integration.