Skip to main content

Vector Embeddings

Vector embeddings are dense representations of tokens—such as sentences, paragraphs, or documents—in a high-dimensional vector space, where the dimensions encode abstract features learned during training. They enable efficient similarity comparisons and power applications like search, recommendation systems, and advanced data analysis.

For example: a sample embedding in a 12-dimensional space would look like the following:

[	
  0.002243932,	
  -0.009333183,	
  0.01574578,	
  -0.007790351,	
  -0.004711035,	
  0.014844206,	
  -0.009739526,	
  -0.03822161,	
  -0.0069014765,	
  -0.028723348,	
  0.02523134,	
  0.01814574
]

Using vector embeddings in your Striim application

Striim enables you to integrate vector embeddings into your streaming applications using the Euclid AI Agent. This agent can be added directly from Flow Designer to generate embeddings from selected data fields and write them to downstream systems. Vector embeddings are commonly stored in vector databases such as Postgres or other supported targets, and Striim provides target adapters for integrating with these systems.

The Euclid AI Agent encapsulates the setup required to connect to an AI model (such as OpenAI or Vertex AI) and generate high-dimensional vector representations from your application data. Once configured, you can use built-in SQL functions to transform and pass selected data to the agent for embedding generation.

You can apply SQL-based transformations to prepare your input data for embedding. To use vector embeddings, first configure the Euclid agent in Flow Designer and specify its parameters. The agent supports both OpenAI and Vertex AI models.

Adding a vector embeddings agent in Flow Designer

You can add the Euclid AI Agent to your application directly from the Flow Designer to generate vector embeddings for real-time data.

Configuring basic parameters for Euclid AI Agent

The following basic parameters for the Euclid (Vector Embeddings) AI Agent are configured in the property panel when the component is selected in the Flow Designer canvas:

  • Name: specify a unique name for the Euclid component.

  • Namespace: specify the namespace in which the component operates.

  • AI Model Provider: select the AI provider. You can select OpenAI or Vertex AI.

  • Provider-specific settings:

    • For OpenAI, provide the following:

      • API key

      • Model name

      • Organization ID

    • For Vertex AI, provide the following:

      • Project

      • Model name

      • Service account key

      • Location

      • Publisher

Configuring input for Euclid AI Agent

The input configuration for the Euclid AI Agent is performed in the Input panel after you connect an input stream to the Euclid component in Flow Designer. The following options are available:

  • Input name: specify the name for the input connection.

  • Text Field / Alias: select the text field to be processed and provide an alias for use in the embeddings output.

  • Add Fields: optionally select additional fields from the input stream to include in the embeddings output.

  • Embedding Generator: select or configure the embedding generator as supported by the chosen AI model provider.

  • Fields to be Passed On: specify a list of field/alias pairs to be included in the output.

  • Edit TQL: use the “Edit TQL” link in the Input panel to view and edit the generated TQL for the Euclid component.

    ai-agents-euclid-vector-embed.png

Creating a vector embeddings generator using TQL

Sample TQL for creating an embeddings generator object:

CREATE OR REPLACE EMBEDDINGGENERATOR OpenAIEmbedder2 USING OpenAI (
modelName: 'text-embedding-ada-002',
apiKey: '**',
'organizationID': 'orgID' // optional
);

CREATE CQ EmbGenCQ
INSERT INTO EmbeddingCDCStream
SELECT putUserData(e, 'embedding', java.util.Arrays.toString(generateEmbeddings("admin.OpenAIEmbedder26",
  TO_STRING(GETDATA(e, "description")))))
FROM sourceCDCStream e;

CREATE OR REPLACE TARGET DeliverEmbeddingChangesToPostgresDB USING DatabaseWriter (
  Tables: 'AIDEMO.PRODUCTS,aidemo.products2
    ColumnMap(product_id=product_id,product_name=product_name,description=description,list_price=
    list_price,embedding=@USERDATA(embedding))',
  Username: 'example',
  Password: 'example',
  BatchPolicy: 'EventCount:100,Interval:2',
  CommitPolicy: 'EventCount:100,Interval:2',
  ConnectionURL: 'jdbc:postgresql://url:port/postgres?stringtype=unspecified'
 ) INPUT FROM EmbeddingCDCStream;

Supported targets for writing vector embeddings

The following are the supported targets for writing vector embeddings and the recommended data types.

Target type

Recommended data type

Target version

PostgreSQL

vector(<dimension>)

PG13+ (see pgvector documentation)

Snowflake

Array

BigQuery

array<float64>

MongoDB Atlas and Cosmos DB for MongoDB vCore

n/a

Spanner

Array

Databricks

Array

Azure SQL database

varchar(max)

Fabric Lakehouse and Fabric Data Warehouse

String

Single Store MEMSQL

For SingleStore versions 8.5 and above: Vector

For SingleStore versions below 8.5: Varchar

Both 8.5 and above and below 8.5

Oracle

Vector

Oracle 23ai

Embeddings model type options and supported models

The following are the model types available and the connection parameters needed:

Table 1. Embeddings model type options

AI model provider

Embeddings model name (dimensions)

Organization

Required connection parameters

OpenAI

GPT text-embedding-ada-002(1536)

OpenAI

API_Key, OrganizationID (optional)

VertexAI

textembedding-gecko(768)

Google

ProjectId, Location, Provider, ServiceAccountKey



Table 2. Supported models

Model name

Model provider

Token limit

Dimensions

text-embedding-ada-002

OpenAI

8192

1536

text-embedding-3-small

OpenAI

8192

1536

text-embedding-3-large

OpenAI

8192

3072

textembedding-gecko@001

VertexAI

3092

768

textembedding-gecko-multilingual@001

VertexAI

3092

768

textembedding-gecko@002

VertexAI

3092

768

textembedding-gecko@003

VertexAI

3092

768

textembedding-gecko@latest

VertexAI

3092

768

textembedding-gecko-multilingual@latest

VertexAI

3092

768



Table 3. Default models

Model provider

Model name

OpenAI

text-embedding-3-small

VertexAI

textembedding-gecko@003



Using batching when generating vector embeddings

Batch processing can improve performance for use cases where you want to generate vector embeddings from streaming data. Batching aggregates the events and data to be embedded, generates the embeddings using the generateEmbeddingsPerBatch function, and flattens the resulting nested data structure into a single list of embeddings.

When using the Euclid AI agent component in Flow Designer, batching is supported if the selected fields (columns) are of type List<String>. The system will automatically generate a continuous query (CQ) that uses the generateEmbeddingsPerBatch function to perform batch processing.

In the Euclid AI agent configuration panel, you can select one or more fields to embed. When batching is enabled (via selecting a list-type field), the Euclid component configures an internal window, batch CQ, and embedding CQ to optimize the embedding process.

Note

Striim currently supports batching only for vector embedding models configured with OpenAI.

// Window to batch
CREATE JUMPING WINDOW <window_name> OVER <Stream>
KEEP <window_policy>;

// Aggregate the events and data to be embedded.
CREATE OR REPLACE CQ <CQ_name>
INSERT INTO <stream>
SELECT list(w) as events, list(TO_STRING(<col_name>)) as data FROM <window_name> w;

// Generate embeddings per batch and aggregate into a list.
CREATE CQ <CQ_Name>
INSERT INTO <out_stream>
SELECT makeTupleList(a.events, generateEmbeddingsPerBatch(<namespace>.<object_name>, List<data>)) as objs FROM <int_stream> a;

// Flatten the list.
CREATE CQ <CQ_name>
INSERT INTO <out_stream>
SELECT putUserData(cast(origevent.get(0) as com.webaction.proc.events.WAEvent),
  'embedding',
  java.util.Arrays.toString(cast(origevent.get(1) as java.lang.Float[])))
FROM embOut e, ITERATOR (e.objs, java.util.List) as origevent;