Striim CTO and Co-Founder, Steve Wilkes, discusses customer use cases where real-time data integration with built-in stream processing solves key challenges related to today’s fast-paced, heterogeneous and hybrid data environments.
Unedited Transcript:
Welcome and thank you for joining us for today’s webinar. My name is Katherine and I will be your moderator. The presentation today is entitled “How Industry Leaders Are Using Streaming Integration to Modernize Their Data Architecture.” Our presenter is Steve Wilkes, co-founder and CTO of Striim. Throughout the event, please feel free to submit your questions in the Q and A panel located on the right hand side of your screen. With that, it is my pleasure to introduce Steve Wilkes.
Good morning everyone, or afternoon, depending on where you are. Today we’re going to go through some of the use cases around streaming integration, why it’s important, but I know just to get us though, we first have to talk about what streaming integration is and give a little bit of information about the Striim platform, which is how we implement streaming integration. So what is streaming integration? Streaming integration is all about continuously moving any enterprise data, while handling huge volumes of data, being able to scale that while maintaining really high throughput, but also being able to process that data, analyze it, and correlate it in-flight while it’s moving. Especially around tasks that may be involved with integration such as joining disparate data sources together. And this is all to make your data more valuable, to give you visibility into it and do all of this in a verifiable fashion.
Streaming integration really comes from the natural way of dealing with data. Data doesn’t arrive in batches. There’s no data that was created as a batch. Data is continuously being generated because things in the real world are happening. Humans are doing stuff, working on applications, going to websites, and that’s creating data as rows in databases; always generating lines in machine logs and machines are doing things as well. They’re reacting to things. They are running applications, they’re writing logs. And then you have devices that can generate events. And none of these things are created in batches. But batches were used because that was a way of dealing with technological limitations of the day. That storage was cheap and memory and CPU were expensive. And so the natural way of dealing with large amounts of data was to store it and then process it in batches as you stored this data up.
But the storage is not keeping up with the amount of data that is being generated. And in fact storage is probably only around, depending on who you ask, six to 10 percent of all the data that’s generated in the world. But CPU and memory have gotten much cheaper. So the natural way of dealing with data is through streaming technologies. It’s by collecting data as it’s being created and streaming that throughout your enterprise. Being able to deal with things as that data is created and not after the fact as you would do with that.
It’s not just collecting the data, but being able to process that data and analyze it and visualize it and get insights from it immediately, and being able to put that data where it makes sense. So technology has evolved. Analytics happened on data warehouses, or you’d have data in databases, you’d move into data warehouse, that’s where you do analytics. But today there are lots of different technologies in place in most organizations because you need to use the correct technology to ask the right questions of your data. In most cases people aren’t storing data in a data warehouse, for machine learning purposes, they’re storing it in Hadoop or in the clouds things as your blob storage. They are putting in Hadoop or in the cloud. They’re using that because that is a natural place to store large amounts of data from machine learning.
They may be using draft databases or other technologies to ask the right questions of their data. So being able to integrate with the correct technologies is really important to get the data in the right form for the correct technology is also important. And, and we believe that using SQL as a language for processing data and working with it, whether it’s stationed real or whether it’s streaming, as in the streaming integration, is the correct way of dealing with data because it has people that know data to work with it. And importantly, you need to be able to run this SQL continuously on the data as it’s streaming before it ever hits disk.
The companies are using streaming integration for lots of different purposes and they range from things like a data distribution, moving change data from databases in real time and pushing that on to Kafka or other event distribution mechanisms on premise or in the cloud, integrating databases together and keeping databases up to date with each other, being able to keep your memory date occurrence consistent with databases but then adopting technology as well. So people are using the streaming integration, migration data to the cloud to things like a Amazon Redshift, Google BigQuery and Azure SQL DB. I’m doing this continuously rather than in batches and also moving data into things like to batches but continuously keeping the up to date. So it’s always ready to ask questions off. Then there are people that are doing real time analytics. They are analyzing streaming data as it’s being created, visualizing it and alerting off it and getting insights from it immediately. And other people are using the streaming integration for IoT. So processing data at the edge, doing change detection at the edge, redundancy, remove filtering and then moving that into a central place for analytics.
And all of this is because customers are modernizing. As I mentioned, it’s important to choose the correct terminology to be able to make decisions. We have customers with legacy systems that are 20, 30 years old. Other customers legacy means two years old because it wasn’t the right technology to be able to ask the right questions of the data. So being able to work with data, whatever is created and still be able to get the benefits of streaming integration is really important. So you can think of streaming integration as a nice shiny new aluminum and glass stretches and built on the side of your existing legacy system. This enables you to access that data in real time and then move it to the correct places from let’s fix and do in memory real time analytics on that data to get real time insights.
So talk a little bit about the Striim platform before we go into some of the real world use cases. The Striim platform is a full platform for streaming integration and by going through what the platform comprises of, you’ll understand all the pieces that you’ll need for streaming integration. So streaming integration starts with continuous data collection and the goal here is to turn any enterprise data source into a stream of data. Something’s not treating pushed data at you, so sensors and message cues, etc. They can push it say that you in certain cases, but things like log files and databases aren’t inherently streaming. With log files, you don’t want to treat them as batches. You want technology that can read at the end of those files and stream the new data add as soon as it’s written because being able to handle a file rollover and those kinds of policies, but inherently being able to stream that data as it’s being written to the file.
Suddenly with databases you can’t do SQL queries against the database or use triggers. Most deviations will frown upon that. But you can use a technology called change change data capture, but that will listen to the transaction logs of the databases and stream out the change, the inserts, updates and deletes as they’re happening in real time. So once you’ve done this continuous data collection you have in memory streams of data as it’s being created, I think it can be moved anywhere within the enterprise or cloud and delivered to targets. And it’s a big mix and match bag of skittles. You can take things from a databases, change data capture and push that into Kafka. You can take things from Kafka and you can push those into Hadoop. You can take things from log files and you can push that into the cloud. And that real time continuous movement enables those targets to be kept up to date with the sources.
Pretty often we see Kafka in this mix. So in addition to us being able to read from Kafka and write to sources and targets, we also ship with Kafka as part of our products. And it is an optional use. Wherever we have a data stream and you’ll see in the latest slide kind of data flows and data flows can be quite complex. They could involve lots of different data streams. Typically you want almost all of those data streams to be in memory. You don’t want to be utilizing a persistent message queue for all of those data streams because then you’re racing to desk and you slowing things down. You’re not getting the benefits of true in memory processing. So we seek Kafka mostly as a persistent way of backing initial data collection that enables you to do things like exactly once processing and recovery on the sources that you can’t ask for data.
Again, so Kafka is an important aspect of our products, but we have our own high speed in memory messaging system as well that is used for most of the pieces of updated sources, on top of this movement. You then start to get into the processing of data and we do this through a SQL based continuous queries that enable you to do things like filtering transformation, aggregation of data. We also include a built in in memory data grid or distributed cache that enables you to load large amounts of reference data into memory and join that and enrich the streaming data with reference data. On top of this, you can do more complex tasks like a correlation and anomaly detection, statistical analysis, complex event processing, all on streaming data. Depending on how complex your integration tasks are, we also include the ability to build realtime dashboards and send alerts. And while a lot of these dashboards play an important essential aspect and analytics hosted of applications, customers also use them to build custom monitoring solutions around integration flows.
The platform could also integrate with things that can aren’t on these lists. So if you want to integrate with machine learning or a graph database for example, that is pretty easy to integrate our platform with other things and have that available within the SQL based processing. And we will do all of this in an enterprise grade fashion that is inherently distributed, reliable, scalable and secure. It’s a bit of an eye chart and this sums up all of the sources and targets that you’ll see within the Striim platform. By the time I present this as huge data to update because we’re always adding new things that we can work with. And very recently for example, we just added Apache Kudu. You can see a full list of our sources and targets on our website.
This is what data flow looks like. Typically you’re starting at the top, you are collecting real time data and the continuous fashion and moving it through the data streams with processing in between. All of the processing is done through these SQL based continuous queries that enable you to do all of those processing and transformation. Unless it’s on the data and the visualization is done through dashboards. We have this dashboard builder built into the product. We can drag and drop visualizations, define, hey you get the data from the backend and now you have a live real time dashboard. But you can also go back in time and you can fill so you can search the dashboard, a very comprehensive visualization solution.
And we believe that our platform is best for streaming integration because it has been designed from the ground up to be scalable, distributed in memory, and to provide all of the enterprise grade qualities that you’d expect such as scalability, a recovery, fail over, a full security model. And exactly once processing guarantees the platform was designed to be easy to use and integrates with everything that you have today in your enterprise. So that enables you to deliver solutions really quickly and iterate over those solutions. They give you applications, operationalize the data value, right? Giving value for me while it’s still relevant.
So that’s a quick overview of the Striim platform and then we’ll go into what the Striim platform can do and you can piece it together in lots of different ways that give you lots of different solutions. And so some of these are some of the things that customers are doing with Striim in the real world. We’ll start with leading credit card network so that doing quite a lot of things with their data. And this particular example is a security data and it starts with being able to collect the data as security data is created from lots of different sources. It could be things like a riser information, a web logs, system logs, a VPN information firewall. All of that data is in different forms is written to different places you need to collect it all and then distribute it somehow. So the goal here was to be able to initially just distribute the data over Kafka but not just the raw data. They wanted to be able to correlate that data. So put similar things together and push that onto Kafka’. And so they’re collecting data from all these different sources, joining it by things like session idea, IP address within certain amount of time. So what you’re pushing out to records that contain all of the things that happen for this particular IP address within the last second or so.
A security data hub on Kafka. They need these guys to push you out somewhere else. So they utilize us for doing further processing of the data and pushing it as on two different topics of pre-prepared data on Kafka, but then also writing it into things like databases, some of it into the cloud, some of it into Hadoop to do different types of analytics based on what questions they want to ask of the data. And they’re also doing some in memory analytics on that streaming data based on the correlations to be able to send alerts. The core of this was that they wanted it to be able to optimize the way that security analysts work and to give them more times. You deal with real issues and, and so by being able to correlate the data and analyze it in real time, they can push out high priority alerts and need focus immediately and not the low level little alarms going off from all their different security products.
So now these global media and technology company are doing very different things. They’re sourcing data from SQL Server and Oracle that includes customer data and orders and you know, all these other things that are going on within the enterprise. And what they wanted to do was provide completely up to date information for the customer service people based on this customer data. So they could be sure that information that they were looking at was always up to date. So they are pushing out this data from Oracle and SQL Server onto Kafka but then also into cloud applications and hybrid cloud systems that are continuously in sync and always providing their line of business with up to date information.
Yeah, this is an interesting one. This online retailer for a health and beauty products, they downloaded our platform and started building out this application. Initially in order to get budgets who filled out the application, it needed to prove that this real-time data has an ROI in order to get the budget for working with the data. So this was used as a way of showing management just how important this real time data was. Now this is continually providing real time analytics on website data and what customers are ordering. And this increases the productivity of lots of different teams, including the QA performance teams for the website. So they can see if the website is having any issues, if those issues are impacting orders and what they need to do about it in order to fix things so that the customers are continually happy. Okay. And they include real-time dashboards, insight into all this information and alerts so they can make instant decisions.
Okay. It’s a North American financial institution. They are also adopting the cloud. And in order to do this they are utilizing Azure and they using the Azure Event Hub as a gateway into cloud applications. Okay. So they’re utilizing streaming integration to move data from on premise systems. And in addition to Oracle, they have HP Nonstop, used to be known as Tandem, which is running all of their real time transactions. So that’s taking this data from HP Nonstop and Oracle using change data capture and delivering that as in real time into as the Events Hub. So they get real time data for financial reporting listings and other applications.
Yeah, it’s a European satellite ITV company. They using streaming integration too for real time CRM information. So they have data in a CRM database and they are using change data capture to get real time insights into what is happening with that customers. Okay. And because of rules around customer information, a lot of the analytics side doing on this, these to be anonymized for those of you haven’t heard of GDPR. Yeah. It has lots of different rules around the way you deal with customer data and to prevent the analytics systems ever having customer data they’re doing in data masking to remove a lot of the identifiable information but still retaining the essence of the data. So analytics still work. So they’re using streaming integration, deliver the changed data from the a CRM systems with masking into Kafka where they can then build real time customer analytics.
Then we have this digital TV company. Well what this company is doing that actually an OEM that using streaming integration and analytics in order to spot potential piracy or breaches of subscription rules around watching video on demand. And this is a service that they sell to their customers for people that provide video to consumers. And so things like a video being downloaded too fast for it to be being watched. This indicates it may actually being downloaded for piracy and lots of different rules around that. So Striim is embedded here for continuous collection and monitoring and analytics of real time data.
Then finally we have this devices, showrunners and support services provider. And again, they’re doing an adoption model here where they are moving on premise data from databases into various Amazon technologies. So moving data from SQL Server and Oracle on premise and delivering that in real time to Amazon Redshift and S3, keeping those completely up to date, which then allows the cloud applications and operational reporting and real time analytics to always have up to the second real-time customer information for warranty and claim information. So those, are a lot of different use cases using a lot of different aspects of streaming integration and what they kind of have in common is that they are talking about utilizing new technologies and utilizing them in the appropriate way, but also being able to minimize risk by detecting security threats and ensuring compliance and preventing fraud and piracy and things like that. And doing that in real time and enhancing the customer experience doing things quickly so that you can make the most value of your data to reduce inefficiencies and costs and offer new services to customers.
So we don’t just kind of get into what does this look like? How do you work the streaming integration? And we’ll show Striim and some of the things that you can do integrating data. And we just start with just taking data from a database and pushing out onto Kafka. So here we can use one of our built in templates is basically a wizard for moving data into Kafka. There’s lots of different ways you can do this. And we’re going to take data from a MySQL database and pushed onto Kafka. This little a data flow. So we’ll name the application and then what we’ll do is configure how’s it connect to the MySQL database. So you just enter the properties connection and use name, password, etc. And then we’ll look at the database and make sure it’s ready for change data capture.
If it’s not, we’ll tell you what to fix and you’ll be up and running really quickly. You then saw like the tables that you want and we’re just gonna say one table in this case. Yeah. And that’s going to create a data stream and that data stream is going to be written into a talkative, which is going to write into Kafka. So you need to configure the properties of how you get the data from Kafka and what format you want the data. In this case we’re going to use JSON, if we save this, this is going to generate a data flow. And the data flow is very simple. In this case, we’re doing change data capture from MySQL and we writing into Kafka and we’re writing that in. If we start this up, we needed to deploy it first at terms this description of the application into runtime objects and then start it.
And you can see in the data stream we have changes going on in the database all the time. You can see the data stream here. This is the raw data. This is what we’re collecting, the before image and the after image of updates that’s happening within the state.. What it looks like before the update or what it looks like afterwards. Pretty often though, you don’t want the raw data. So we’re going to modify this data flow to add in some processing. Yeah, we can do that by adding a new query into this data flow. And this is an in-memory continuous query. That’s good. Somebody find the data in some way. We’ll start off by just getting the fields that we want. So we’re taking some of the data from before the updates.
And then the previous one is before the update. The rest of it is after the update so you can see the change in structure of the data there. Okay. That’s just like converting the data into the form we want. We’ll change the data stream here and instead of the raw data, we know that they’re right, that change data adds Kafka that is going to generate JSON and then only contains that data. Now we also want to add additional context information to that and we can do that by loading data into memory. So we had a product id in that data. We’re going to join that with product information that we got load from database table into memory. They can tell you more details about that product. So we specify this in memory data grid, this cache, that we are going to use.
We got to select that data from a database table that will, when we deploy this application, load that into memory and that is usable with an inquiry. So we’ll modify the query to also pull out description, brand and category, etc. from the cache, join it on product Id and then you can see that we have all of the information is being collected and included in the data stream is going out to Kafka. So, to show you kind of what that looks like now that we have that running and it’s continuously delivering the data out JSON on Kafka, we’ll create another application that will read that data from Kafka and we’ll start to do other things with it. So let’s start off by adding the change data capture that’s going to query that data from Kafka in real time. And it was JSON.
So we use the JSON parser and we add in a data stream. Again, if we save that, deploy it and run that, you’ll be able to see what that data looks like. And you can see here the data is chased on containing all of the information that we put into that using that query. Now we’re going to write a continuous query to pull just the things from that. This is going to use the JSON structure and you can deal with data really easily in our platform. So pull from JSON reader and create a data stream. And now if we look at it, you see that the stream looks different. It’s not the original JSON data. And then we separated it back out into the various fields. If we want to write that data out somewhere, then we can use different targets.
So in this case we’re gonna just write to a file. This could be a file on disk, we could use other adapters. We’re going to write it out as a delimited data so that we can then load it in a spreadsheet and see what the data looks like. So we’ll use our delimited parser by default. That is a common delimiter structure. Save that and there, when we run this application that is going to stream that data into a file. All those who sets up we’ll go through the transformations. No, everything’s a file that we can look at with excel, for example. So it’s the same data, no everything else with but a single dataset. You can also write out other things. And so just as easily you can select additional targets. And for example, right at data entry too.
But we’re not going to write all of the data. We’re just going to write some of the data sets to where it’s just one location where the location id in the data is equals 10. So some filtering going on there. Before we write data to it, we’ll add in the HDFS target and drop that into the diagram. And then when we can figure this, we do need to specify the connection into various different properties that we need. And then what the format is. And we’re going to write your sentence in Avro. When you write a Avro within that platform, you can specify a scheme and file name and we’ll write that scheme on file as well so that the people down the line will know what that data looks like. And then finally we’re going to do a similar thing, but we’re going to also write a data out to Azure.
And so we’re going to use the Azure Blob Storag target and take the original data stream that had all the data and configure information with the properties with the information that we get from the Azure Blob Storage definition and right beside as JSO is the Blob Storage. So when we run this application, things would be written to file and to Bob Storage all at the same time. Yeah, you can do more than just simple data movement. This is a more complex case where we’re taking data from web logs and we’re doing quite a lot of processing of that data to get it in the form we want. Yeah. Do some regular expression matching, etc. And who writes in that eight on to Kafka? So this is a use case where we’re distributing weblog data as over Kafka.
And you can see that’s what it looks like. Anyone familiar with web logs? We’ll see. That’s how easy it is to kind of get to weblog data. We even have an Apache access log parser that makes it really easy. Now, this is an application that’s doing analytics on that. So you’ll see there’s a lot more complex, a lot more processing going on, but it’s taking the data from Kafka and then it’s doing aggregations of it over time. So I was looking for, how many within last 30 seconds, what are used to doing some user behavior analytics and other analytics and storing that data in a built in raise results storage. And then yeah, we built this dashboard over the top of it that shows you what people are searching on, the response time and a whole bunch of other things by dragging and dropping visualizations from the side into the dashboard builder, configuring them.
And then we have a working dashboard. And that’s one example. Here’s some additional visualizations that are built with our platform. This is monitoring security within a factory. This is monitoring financial transactions in real time and looking for an increase in decline rate. This is monitoring production quality, a factory on an alerting and this is monitoring passenger traffic through an airport and looking for where you need to move employees to handle additional passionate passenger traffic. So just some examples of the types of visualizations that you can build with our platform that are based on this streaming integration. That is the core of all of the analytics. So the streaming platform and streaming integration can support your enterprise goals. By using this built in the stream processing listings, visualization delivery and I wizards-based development too. Get to market really quickly to build the applications without having to build the whole infrastructure first.
You can just download Striim and get up and running really quickly. You start building applications like you seeing today, moving your data to the technology that makes sense to the questions you want to ask of it. It is an enterprise grade solution that is designed with this end to end security scalability recoverability exactly once processing guarantees. Okay. And it reduces the total cost of ownership because you can get up to speed really quickly and we support all of the technologies that we’re utilizing our platform including Kafka. And it’s easy to iterate on as ease to build additional applications. Once you have streaming data streaming is recognized kind of by the industry. We continually getting these awards of it’s a product but importance of us also a great place to work and we keep on being viewed as a a company with a great culture. People enjoy working for.
If there’s two things I want you to take away from this presentation today, firstly now is the right time for streaming integration. It’s the essential part of any data modernization initiative and you can do data distribution. You can build hybrid cloud and all of the other use cases that you want utilizing streaming integration. It’s also the foundation for real time analytics applications. You get real time insight into data and Striim is the right solution for streaming integration cause we were designed from the ground up for real time continuous business critical applications. Our platform gives you the ability to deliver a tangible ROI very, very quickly with a low total cost of ownership. And it provides for all the enterprise sources of targets that you care about. So thank you for your time. Thank you for listening to me tell you about all the great things that you can do with streaming integration and all of the use cases supports. And we’ll now open it up for questions.
Thanks so much Steve. We are past the half hour mark. For those who can stay on, we’ll answer a few questions and for those who need to drop off, thank you so much for joining us. Uh, now let’s turn to your questions. Our first question is can you stream do analytics on store data in addition to streaming analytics?
Okay, so our focus really is on streaming and streaming data, right? But we do have this built in results store that is backed by Elastic Search that you can stream data into really easily. You can change where you store this data for our results. So aware these data streams go, you can put them in databases and other places as well. Um, and then yes, you can go back in time and you can visualize stored data, but we’re not a visualization tool for Hadoop or for a dataware has either things like that. If you wanted to work with that data, you’d have to load it into the platform in memory first in order to do any visualization analytics on it.
Great. Excellent. Okay. Um, next question. Uh, how does stream mask or encrypt data?
So this is through a functions within a SQL. So we have some functions that we have built that allow you to do that, that masking and encryption. Whenever you’re dealing with data streams, if you’re putting data onto Striim and it’s moving from one place to another within the enterprise, that data is encrypted on the wire, whether it’s using our in memory data streams or it’s utilizing Kafka. In fact, we had security on Kafka before Kafka had security on Kafka because we view security is really important. But if our built in functionality for masking or encryption isn’t sufficient, it’s very easy to build your own functions and then call them from our SQL.
Thank you. Uh, the next question is can you read from a postgres database.
So right now we can load referenced by different postgres and we can write into progress. We don’t currently have change data capture from postgres. Part of our platform is definitely in our roadmap.
Thanks Steve. Next question. Can Striim act as assist log server to allow applications to directly push this log messages to Striim?
That’s a great question. I know that we have done that. In fact we have a security list as part the Striim team who’s building out security applications and he does exactly that. We can probably get back to whoever asked that question with more details on that approach.
Absolutely. This is a tangential, just more simply the stream have a cis log interface to read data from.
Okay. Can you use the system of API three data from Striim? Not currently. We can dig into that one in multi-sale as well as pretty ways around it.
Great. Thank you. Can we create visualizations in Striim and publish those dashboards on another website internally in the organization?
Yes, you can. We added in the capability to embed dashboards a couple of releases ago and so you can create a dashboard page, all the visualizations you want, click on a drop down, choose embed, and it’ll create the tags for you. Just drop those tags into any website you want. And we deal with all of the security behind the scenes with generating a new user for you, asking you to give it permissions, etc. and a security token that is embedded in those takes. So, if he wants to revoke that, you can do that any time. It makes it really easy to embed Striim visualizations anywhere else.
And I think you partially answered the second part of this question. Can we also control access to the dashboards using ad groups?
Okay. Um, not currently. We have Alibaba integration for users and authentication. But the authorization aspects you have to do through our role based security model. It’s probably something that you could write the script to automate, but we don’t have the group level authorization integration right now.
Okay, great questions. Does stream have any parameters that need to be set at field level comparison for analyzing the data?
Okay, so I think to reiterate our approach is processing and analyzing data is through this SWL based language. It’s a very big extension of SQL is based on MCC called but it has a lot of additional capabilities. Obviously when you’re dealing with streaming data you have to deal with things like time series. We have geospatial types of capabilities. We have lots of other statistical analysis type of functions including things like real time linear regression and we have a complex event processing language built into the SQL that allows you to look for sequences of events over time in order to do processing. And if that’s not enough, you can write your own focus in Java and utilize those within our SQL as well. And use any Java methods of any Java objects in the SQL to enable you to handle unstructured and structured data. So, the reading all parameters per se of the processing is SQL that you write, this is exactly what you want. If you want to parameterize that, you could do that by utilizing cache information or windows or some other means and joining the SQL with that. We have examples of people that put rules into caches and windows and so they can change dynamically. But it’s, the SQL approach is designed so that data scientists and business on this, anyone that knows data can work with it and we can give you more information on that if you’re interested.
Thanks, Steve. Just one last question. I know you spoke quite a bit to this point, during the presentation, but can Striim read JSON input files?
Yes. The short answer is yes. The longer answer is definitely yes.
Terrific. Great. Well, it looks like we’ve answered all of our questions. On behalf of Steve Wilkes and the Striim team, I would like to thank you again for joining us for today’s discussion. Have a great rest of your day.