Vector Embeddings

Vector embeddings are dense representations of tokens—such as sentences, paragraphs, or documents—in a high-dimensional vector space, where the dimensions encode abstract features learned during training. They enable efficient similarity comparisons and power applications like search, recommendation systems, and advanced data analysis.

For example: a sample embedding in a 12-dimensional space would look like the following:

[	
  0.002243932,	
  -0.009333183,	
  0.01574578,	
  -0.007790351,	
  -0.004711035,	
  0.014844206,	
  -0.009739526,	
  -0.03822161,	
  -0.0069014765,	
  -0.028723348,	
  0.02523134,	
  0.01814574
]

Using vector embeddings in your Striim application

Striim enables you to integrate vector embeddings into your streaming applications using the Euclid AI Agent. This agent can be added directly from Flow Designer to generate embeddings from selected data fields and write them to downstream systems. Vector embeddings are commonly stored in vector databases such as Postgres or other supported targets, and Striim provides target adapters for integrating with these systems.

The Euclid AI Agent encapsulates the setup required to connect to an AI model (such as OpenAI or Vertex AI) and generate high-dimensional vector representations from your application data. Once configured, you can use built-in SQL functions to transform and pass selected data to the agent for embedding generation.

You can apply SQL-based transformations to prepare your input data for embedding. To use vector embeddings, first configure the Euclid agent in Flow Designer and specify its parameters. The agent supports both OpenAI and Vertex AI models.

Adding a vector embeddings agent in Flow Designer

You can add the Euclid AI Agent to your application directly from the Flow Designer to generate vector embeddings for real-time data.

Configuring basic parameters for Euclid AI Agent

The following basic parameters for the Euclid (Vector Embeddings) AI Agent are configured in the property panel when the component is selected in the Flow Designer canvas:

Name: specify a unique name for the Euclid component.
Namespace: specify the namespace in which the component operates.
AI Model Provider: select the AI provider. You can select OpenAI or Vertex AI.
Provider-specific settings:
- For OpenAI, provide the following:
  - API key
  - Model name
  - Organization ID
- For Vertex AI, provide the following:
  - Project
  - Model name
  - Service account key
  - Location
  - Publisher

Configuring input for Euclid AI Agent

The input configuration for the Euclid AI Agent is performed in the Input panel after you connect an input stream to the Euclid component in Flow Designer. The following options are available:

Input name: specify the name for the input connection.
Text Field / Alias: select the text field to be processed and provide an alias for use in the embeddings output.
Add Fields: optionally select additional fields from the input stream to include in the embeddings output.
Embedding Generator: select or configure the embedding generator as supported by the chosen AI model provider.
Fields to be Passed On: specify a list of field/alias pairs to be included in the output.
Edit TQL: use the “Edit TQL” link in the Input panel to view and edit the generated TQL for the Euclid component.

Creating a vector embeddings generator using TQL

Sample TQL for creating an embeddings generator object:

CREATE OR REPLACE EMBEDDINGGENERATOR OpenAIEmbedder2 USING OpenAI (
modelName: 'text-embedding-ada-002',
apiKey: '**',
'organizationID': 'orgID' // optional
);

CREATE CQ EmbGenCQ
INSERT INTO EmbeddingCDCStream
SELECT putUserData(e, 'embedding', java.util.Arrays.toString(generateEmbeddings("admin.OpenAIEmbedder26",
  TO_STRING(GETDATA(e, "description")))))
FROM sourceCDCStream e;

CREATE OR REPLACE TARGET DeliverEmbeddingChangesToPostgresDB USING DatabaseWriter (
  Tables: 'AIDEMO.PRODUCTS,aidemo.products2
    ColumnMap(product_id=product_id,product_name=product_name,description=description,list_price=
    list_price,embedding=@USERDATA(embedding))',
  Username: 'example',
  Password: 'example',
  BatchPolicy: 'EventCount:100,Interval:2',
  CommitPolicy: 'EventCount:100,Interval:2',
  ConnectionURL: 'jdbc:postgresql://url:port/postgres?stringtype=unspecified'
 ) INPUT FROM EmbeddingCDCStream;

Supported targets for writing vector embeddings

The following are the supported targets for writing vector embeddings and the recommended data types.

Target type	Recommended data type	Target version
PostgreSQL	vector(<dimension>)	PG13+ (see pgvector documentation)
Snowflake	Array
BigQuery	array<float64>
MongoDB Atlas and Cosmos DB for MongoDB vCore	n/a
Spanner	Array
Databricks	Array
Azure SQL database	varchar(max)
Fabric Lakehouse and Fabric Data Warehouse	String
Single Store MEMSQL	For SingleStore versions 8.5 and above: Vector For SingleStore versions below 8.5: Varchar	Both 8.5 and above and below 8.5
Oracle	Vector	Oracle 23ai

Embeddings model type options and supported models

The following are the model types available and the connection parameters needed:

Table 1. Embeddings model type options

AI model provider	Embeddings model name (dimensions)	Organization	Required connection parameters
OpenAI	GPT text-embedding-ada-002(1536)	OpenAI	API_Key, OrganizationID (optional)
VertexAI	textembedding-gecko(768)	Google	ProjectId, Location, Provider, ServiceAccountKey

Table 2. Supported models

Model name	Model provider	Token limit	Dimensions
text-embedding-ada-002	OpenAI	8192	1536
text-embedding-3-small	OpenAI	8192	1536
text-embedding-3-large	OpenAI	8192	3072
textembedding-gecko@001	VertexAI	3092	768
textembedding-gecko-multilingual@001	VertexAI	3092	768
textembedding-gecko@002	VertexAI	3092	768
textembedding-gecko@003	VertexAI	3092	768
textembedding-gecko@latest	VertexAI	3092	768
textembedding-gecko-multilingual@latest	VertexAI	3092	768

Table 3. Default models

Model provider	Model name
OpenAI	text-embedding-3-small
VertexAI	textembedding-gecko@003

Using batching when generating vector embeddings

Batch processing can improve performance for use cases where you want to generate vector embeddings from streaming data. Batching aggregates the events and data to be embedded, generates the embeddings using the generateEmbeddingsPerBatch function, and flattens the resulting nested data structure into a single list of embeddings.

When using the Euclid AI agent component in Flow Designer, batching is supported if the selected fields (columns) are of type List<String>. The system will automatically generate a continuous query (CQ) that uses the generateEmbeddingsPerBatch function to perform batch processing.

In the Euclid AI agent configuration panel, you can select one or more fields to embed. When batching is enabled (via selecting a list-type field), the Euclid component configures an internal window, batch CQ, and embedding CQ to optimize the embedding process.

Note

Striim currently supports batching only for vector embedding models configured with OpenAI.

// Window to batch
CREATE JUMPING WINDOW <window_name> OVER <Stream>
KEEP <window_policy>;

// Aggregate the events and data to be embedded.
CREATE OR REPLACE CQ <CQ_name>
INSERT INTO <stream>
SELECT list(w) as events, list(TO_STRING(<col_name>)) as data FROM <window_name> w;

// Generate embeddings per batch and aggregate into a list.
CREATE CQ <CQ_Name>
INSERT INTO <out_stream>
SELECT makeTupleList(a.events, generateEmbeddingsPerBatch(<namespace>.<object_name>, List<data>)) as objs FROM <int_stream> a;

// Flatten the list.
CREATE CQ <CQ_name>
INSERT INTO <out_stream>
SELECT putUserData(cast(origevent.get(0) as com.webaction.proc.events.WAEvent),
  'embedding',
  java.util.Arrays.toString(cast(origevent.get(1) as java.lang.Float[])))
FROM embOut e, ITERATOR (e.objs, java.util.List) as origevent;