Tutorial

Build Smart, Real-Time Data Pipelines for OpenAI using Striim

Striim transforms data from hundreds of sources into real-time streams for OpenAI

Benefits

Get Started with Streaming

Learn how to play with real-time streams with simple auto-generated data streams

Real-Time Ingest for OpenAI

Enable true real-time ingest using openai API to build smart AI models

Convert Training data to JSONL format

Use Striim’s Continuous Query to process data into desired format

Overview

JSON data format can be particularly useful for preparing AI training data due to its ease of transfer and data manipulation, allowing for easy summarization of relevant information as part of the prompt. OpenAI accepts the prompt-completion format, also known as JSON line format, for training models. Data preparation is a crucial aspect of creating AI models, and converting JSON to JSON line format is the first step. While Python is typically used to convert dataset formats, for large datasets and production environments, it may not be the most efficient tool.

Striim is a unified real-time data streaming and integration product that enables continuous replication from various data sources, including databases, data warehouses, object stores, messaging systems, files, and network protocols. The Continuous Query (CQ) component of Striim uses SQL-like operations to query streaming data with almost no latency.

In this recipe, we read a JSON file of grocery and gourmet food reviews from a S3 bucket and processed it using a CQ to generate prompt-completion pairs as input for OpenAI model training. To recreate the Striim application, follow this tutorial. To try Striim for free, sign up for the developer version here. With Striim Developer, you can prototype streaming use cases for production use at no upfront cost, stream up to 10 million events per month with unlimited Streaming SQL queries, and simulate real-time data behavior using Striim’s synthetic continuous data generator.

Background

OpenAI is an artificial intelligence research laboratory that was established with the goal of promoting and developing friendly artificial intelligence. Initially, it operated as a non-profit organization that allowed for free collaboration with institutions and researchers by making its patents and research open to the public. However, as artificial intelligence gained more traction and with investments from major industries like Microsoft, OpenAI transitioned from a non-profit to a for-profit organization, with its profits capped at 100 times any investment.

One of OpenAI’s notable developments is the Generative Pre-trained Transformer-3 (GPT-3), a machine learning-driven language model that generates human-like text using pre-trained algorithms. The latest milestone in OpenAI’s efforts to scale up deep learning is the GPT-4 model, which accepts both image and text inputs and produces text outputs that exhibit close to human-level performance on various professional and academic benchmarks.

Natural Language Generation (NLG) is a domain that is responsible for converting structured data into meaningful phrases in natural language form. GPT-3 has been called “the next generation of NLG” due to its ability to understand data, extract meaning, and identify relationships between data points that can be communicated in plain English, which is an open-source and free tool.

There are numerous use cases where OpenAI can positively impact businesses. Developers can use the OpenAI API to create applications for chatbots, content creation, customer service, and more. However, an important aspect of using OpenAI is training the built-in models with training data. A vast amount of data is generated every day, most of which is unstructured. OpenAI expects its training data in Jsonl format, which consists of a prompt-completion pair. Striim’s CQ component can be used to easily convert real-time data from JSON to JSONL format, making Striim a valuable tool in the pipeline.

Why Striim

Striim offers a straightforward, unified data integration and streaming platform that combines change data capture (CDC), application integration, and Streaming SQL as a fully managed service.

Striim can be used for OpenAI by parsing any type of data from one of Striim’s 100+ streaming sources into the JSONL format, which can be easily uploaded to OpenAI for model creation. The following steps can be taken to use Striim for OpenAI:

Set up a Striim account and connect to the data source from which you want to extract data.
Use Striim’s Continuous Query (CQ) component to query streaming data using SQL-like operations and parse the data into JSONL format.
Save the parsed data into a file and upload it to OpenAI for model creation.

It’s important to note that the specific steps involved in using Striim for OpenAI may depend on the particular use case and data source. However, Striim’s ability to parse data into JSONL format can be a valuable tool in preparing data for OpenAI model creation.

In this use case, Striim parses data into JSONL format, which can then be uploaded to OpenAI for model creation.

Core Striim Components

S3 Reader: The S3 Reader source reads from an Amazon S3 bucket with the output type WAEvent except when using the Avro Parser or JSONParser.

Continuous Query: Striim’s continuous queries are continually running SQL queries that act on real-time data and may be used to filter, aggregate, join, enrich, and transform events.

File Writer: Writes files to disk using a compatible parser.

Step 1: Configure your source containing raw data

Please find the app TQL file (passphrase: striimrecipes) from our github repository to directly upload into the flow designer and edit the source and target configuration.

For this recipe, we have read raw data in JSON format from S3 bucket. If needed, please create an IAM user that can access your S3 bucket. If you already have your source set up, go to your homepage. Click create app followed by ‘Start from scratch’ under ‘Build using Flow Designer’

Name your app and click save. You will be redirected to the Flow Designer. Select S3 Reader source from the list of components on the left and enter your S3 bucket name, Object name and choose a relevant parser. For this use-case we have a JSON file, hence a JSONParser is chosen. You can find the JSON file in our github repository.

Step 2: Write the Continuous Query to convert JSON data into Prompt and Completion

A JSON file can be parsed to JSONL using Python but it is a lengthy process compared to creating a pipeline using Striim’s CQ component. Drag a CQ component from the list of components on the left and enter the following query:

SELECT
(‘ReviewerID=’ + data.get(‘reviewerID’).textValue() + “, ” +
‘asin=’ + data.get(‘asin’).textValue()+ “, ” +
‘rating=’ + data.get(‘overall’))
as prompt,
data.get(‘reviewText’).textValue()
as completion
FROM groceryStream j

The above query will continuously parse the incoming raw data into Jsonl format that has a prompt and completion.

Step 3: Read the parsed data and upload to OpenAI using relevant APIs

In this step we read the JSONL file and upload it into OpenAI for model creation. For this demo, we have written the parsed data with fileWriter and uploaded it to OpenAI using “prepare_data” API and trained with curie model/engine using “fine_tunes.create” API. This entire pipeline can be automated with custom Java functions or Open Processors.

For the fileWriter component, specify the filename, directory which is the path of the output file, ROLLOVER and FLUSH Policies and the formatter.

Step 4: Running the Striim application

Click on Start from the dropdown menu to run your app. You can monitor your data by clicking on the eye wizard next to each stream.

Tuning the Model and Asking Questions

You can try out GPT-3 for three months of free credits if you do not have an account yet. For help with fine tuning your model, follow this link. After you have installed OpenAI locally and exported your account’s API key, you can access OpenAI from your CLI . Use fine_tunes.prepare_data API for training data preparation:

openai tools fine_tunes.prepare_data -f <LOCAL_FILE>

Next, create a fine-tuned model using fine_tunes.create API:

openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m curie

The fine tuning job will take sometime. Your job may be queued behind another job, and training the model can take minutes or hours depending on the model and dataset size. If the event stream is interrupted for any reason, you can resume it by running:

openai api fine_tunes.follow -i <YOUR_FINE_TUNE_JOB_ID>

After the model is trained, you can start making requests by passing the model name as the model parameter of a completion request using completion.create API.

openai api completions.create -m <FINE_TUNED_MODEL> -p <YOUR_PROMPT>

OpenAI allows us to optimize algorithmic parameters that will increase the precision of the model. In this recipe, we have trained a basic AI model with grocery and gourmet food reviews. The model can be improved with larger datasets and hyperparameter tuning, and businesses can harness the real-time AI models for better decision-making. Here are some of the questions we asked our model:

Question 1: What are customers hating in coffee?

Question 2: What ingredients do I need to make a traditional panang curry?

Question 3: What spices are preferred in roast chicken?

Question 4: What is the most popular food item consumed?

Setting Up the Striim Application

Step 1: Create a S3 user with required permissions.

Step 2: Configure your source S3 reader. Enter access key and secret key for your user.

Step 3: Parse the source data stream to convert into JSONL format using Continuous Query.

Step 4: Configure the target to write the parsed data using FileWriter.

Step 5: Deploy and Run your real-time streaming application .

Step 6: Use OpenAI API to prepare and tune the data to build an AI model. The AI model responds to questions asked by users.

Wrapping Up: Start your Free Trial Today

Want to try this recipe out for yourself and experience the power of real-time data streaming and integration? Get started on your journey by signing up for Striim Developer or Striim Cloud. Dive into data streaming and analytics with ease and transform your decision-making today. With Striim Developer, you’ll have access to a free sandbox environment that allows you to experiment with Streaming SQL and Change Data Capture for up to 10 million events per month, free forever. It’s an ideal way to dive into the world of data streaming and real-time analytics without any upfront investment.

For those who need a more comprehensive solution, Striim Cloud is the perfect choice. As a fully managed SaaS solution — available on AWS, Google Cloud, and Microsoft Azure — Striim Cloud allows you to focus on building and optimizing your applications while we handle the complex data integration and streaming infrastructure management.