Engineering

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

Keyword search matches only the exact term, while similarity (vector) search matches semantic intent.

Vector Search, Similarity Search, and Semantic Search in the Lakehouse

Want to see e6data in action?

Learn how data teams power their workloads.

Get Demo
Get Demo

By 2025, the world will generate 175 zettabytes of data, and ≈80% of it will be unstructured; yet 90% of that trove is never analyzed(Forbes Tech Council, 2022). Conventional SQL predicates (LIKE '%price%') were designed for rows and columns, not emojis, paraphrases, multilingual nuances, or numeric representations. Vector search, semantic search, and similarity search provide a first scalable way, via high-dimensional vector embeddings, to make these dark corners of the lakehouse searchable without exporting petabytes to a specialist silo.

From Neat Tables to Messy Reality

Early “data warehouse” stories read like urban planning fables: define schema, enforce contracts, rinse, repeat. Then reality intruded. Modern ingest tiers happily dump semi-structured or fully opaque blobs into S3, Delta, and Iceberg: Pick your poison.

A reminder that unstructured data is outgrowing structured data

Text fields become data swamps. For example, a text field labeled ‘Customer Notes’ doesn’t really express what notes it may contain. There is no semantic around it. There are just blobs of text stored in this text field. The dashboard simply ignores sentiment, intent, and sarcasm hidden within text fields.

Text fields are stored in SQL, but it’s challenging to analyze because there is no built-in semantic understanding. Traditional predicates (LIKE '%price%') miss paraphrases, negations, emojis, and multilingual synonyms. The result is an analytics blind spot that grows in lock-step with your customer’s voice.

Why Keyword Search Isn’t Enough

Human language relies on meaning, not token coincidence. The image below compares Keyword Search and Similarity Search: the former highlights identical strings, while the latter clusters semantically proximate phrases, even when their stems diverge.

Keyword match = token coincidence; human understanding = semantic proximity.

semantic proximity ∝ 1 – cosine θ

Keyword search matches only the exact term, while similarity (vector) search matches semantic intent

If you type ‘enterprise plan’ in the keyword search, it ends up looking for the characters in the phrase. But an enterprise plan could mean a premium tier, guaranteed uptime, and white-glove support.  These are not evident when you want to do a textual lookup. A semantic search, in contrast, allows you to do that. When you look up ‘enterprise plan,’ you land on something that gives you details on the premium tier, guaranteed uptime, and white-glove support. Semantic search searches by meaning and not just words.

Example: Think of a support scenario. An agent tries to find “refund frustration” tickets, but the keyword engine surfaces only literal matches. In contrast, a vector engine bubbles up “product didn’t meet expectations,” “want my money back,” and a dozen more paraphrases that the agent would otherwise miss.

Vector embeddings map every phrase into a high-dimensional space; cosine similarity then retrieves nearest neighbors regardless of wording, each vector being a numeric representation of linguistic context.

The Lakehouse Dilemma

Analytical databases excel at predicate push-down, column pruning, vectorized execution—on structured columns. Vector databases bloom on the side, optimized for ANN indices but alien to OLAP joins.

Operational DBs hold a slice, ETL copies into warehouses, specialized vector DBs fork yet another silo. Each hop costs latency, dollars, and governance headaches.

Enter e6data’s “Unify, Don’t Migrate” Motto

e6data’s philosophy—“Unify, Don’t Migrate”—embeds vector functions into the same optimizer that already handles column pruning and distributed scans. One planner, one security layer, zero copies.

Same table, same optimizer, same file reads—only now the planner injects a vector projection. The promise is seductive: no pipelines, zero duplication, federated permissions. Let’s answer the architectural “why”: pushing vector search into the query engine keeps compute close to data locality, sharing cache, scheduler, and security layers already battle-tested for petabyte SQL scans. It turns a maintenance nightmare into a compiler problem.

Semantic Search Up Close

Image credits: Medium

Semantic search works by transforming the unstructured data (images, documents, audio, video) into high-dimensional vector embeddings, i.e., numeric representations. But the way this works depends on the source data.

Take a look at this image:

Translate “too expensive” into [0.815, 0.642, -0.233, …] and suddenly “price is a bit much,” “can’t afford this,” and “costs more than expected” land in the same neighborhood even though they share zero lexical tokens.

Vectors turn free text into geometry; cosine similarity measures angular distance regardless of magnitude.  

A Tiny Experiment

Open your favorite SQL prompt and try the most innocent query against an Amazon reviews export:

SELECT review_id, review_headline
FROM   reviews
WHERE  review_headline ILIKE '%too expensive%'
LIMIT  10;

It returns a handful of hits and misses tens of thousands of paraphrases hiding behind different wording.

With Vector Search, the above query can be rewritten as:

SELECT review_id, review_headline
FROM   reviews
WHERE  cosine_distance(review_headline, 'too expensive') <0.1
LIMIT  10;

This represents a Nearest Neighbour search on review_headline column.

This query can now be extended to have more SQL constructs and use the best of SQL and Vector operations.

How Vector, Semantic, & Similarity Search Work

We’ve seen why keyword search buckles under the weight of customer slang, emojis, and multilingual nuances, and we’ve explored how silo-heavy architectures force teams into cost-intensive copy-and-paste gymnastics.

The takeaway is clear - to mine real insight from the 80% of data that hides in free-form text, we need an engine that speaks both SQL and semantics without spawning yet another datastore. That’s where vector, semantic, and similarity searches step in. But how do they actually convert messy language into precise geometry that the optimizer can race through?

  • Embedding – Models such as SBERT or OpenAI’s text-embedding-3-large convert each document into a 1 k-dimensional vector.
  • ANN Index – There are several indices you can use based on your data set and Vector DBs. Structures like HNSW, SCaNN, and DiskANN accelerate Approximate Nearest Neighbor lookups.
  • Query Flow for e6data bottom-up SQL engine architecture
    • Run the SQL filter
    • Reduce search space
    • Carry out vector search on the reduced search space

Conclusion

Unstructured data isn’t a side gig anymore; it’s the majority of your lakehouse. Vector search, semantic search, and similarity search turn those blobs into queryable geometry without abandoning SQL or spinning up another silo. In our next blog post, we’ll dive into embedding best practices, cosine similarity math, and writing first-class SQL over vectors.

Share on

Build future-proof data products

Try e6data for your heavy workloads!

Get Started for Free
Get Started for Free
Frequently asked questions (FAQs)
How do I integrate e6data with my existing data infrastructure?

We are universally interoperable and open-source friendly. We can integrate across any object store, table format, data catalog, governance tools, BI tools, and other data applications.

How does billing work?

We use a usage-based pricing model based on vCPU consumption. Your billing is determined by the number of vCPUs used, ensuring you only pay for the compute power you actually consume.

What kind of file formats does e6data support?

We support all types of file formats, like Parquet, ORC, JSON, CSV, AVRO, and others.

What kind of performance improvements can I expect with e6data?

e6data promises a 5 to 10 times faster querying speed across any concurrency at over 50% lower total cost of ownership across the workloads as compared to any compute engine in the market.

What kinds of deployment models are available at e6data ?

We support serverless and in-VPC deployment models. 

How does e6data handle data governance rules?

We can integrate with your existing governance tool, and also have an in-house offering for data governance, access control, and security.

Table of contents:
Listen to the full podcast
Apple Podcasts
Spotify
Share this article

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

June 11, 2025
/
Adishesh Kishore
Srikanth Venugopalan
Engineering
Vector Search, Similarity Search, and Semantic Search in the Lakehouse

By 2025, the world will generate 175 zettabytes of data, and ≈80% of it will be unstructured; yet 90% of that trove is never analyzed(Forbes Tech Council, 2022). Conventional SQL predicates (LIKE '%price%') were designed for rows and columns, not emojis, paraphrases, multilingual nuances, or numeric representations. Vector search, semantic search, and similarity search provide a first scalable way, via high-dimensional vector embeddings, to make these dark corners of the lakehouse searchable without exporting petabytes to a specialist silo.

From Neat Tables to Messy Reality

Early “data warehouse” stories read like urban planning fables: define schema, enforce contracts, rinse, repeat. Then reality intruded. Modern ingest tiers happily dump semi-structured or fully opaque blobs into S3, Delta, and Iceberg: Pick your poison.

A reminder that unstructured data is outgrowing structured data

Text fields become data swamps. For example, a text field labeled ‘Customer Notes’ doesn’t really express what notes it may contain. There is no semantic around it. There are just blobs of text stored in this text field. The dashboard simply ignores sentiment, intent, and sarcasm hidden within text fields.

Text fields are stored in SQL, but it’s challenging to analyze because there is no built-in semantic understanding. Traditional predicates (LIKE '%price%') miss paraphrases, negations, emojis, and multilingual synonyms. The result is an analytics blind spot that grows in lock-step with your customer’s voice.

Why Keyword Search Isn’t Enough

Human language relies on meaning, not token coincidence. The image below compares Keyword Search and Similarity Search: the former highlights identical strings, while the latter clusters semantically proximate phrases, even when their stems diverge.

Keyword match = token coincidence; human understanding = semantic proximity.

semantic proximity ∝ 1 – cosine θ

Keyword search matches only the exact term, while similarity (vector) search matches semantic intent

If you type ‘enterprise plan’ in the keyword search, it ends up looking for the characters in the phrase. But an enterprise plan could mean a premium tier, guaranteed uptime, and white-glove support.  These are not evident when you want to do a textual lookup. A semantic search, in contrast, allows you to do that. When you look up ‘enterprise plan,’ you land on something that gives you details on the premium tier, guaranteed uptime, and white-glove support. Semantic search searches by meaning and not just words.

Example: Think of a support scenario. An agent tries to find “refund frustration” tickets, but the keyword engine surfaces only literal matches. In contrast, a vector engine bubbles up “product didn’t meet expectations,” “want my money back,” and a dozen more paraphrases that the agent would otherwise miss.

Vector embeddings map every phrase into a high-dimensional space; cosine similarity then retrieves nearest neighbors regardless of wording, each vector being a numeric representation of linguistic context.

The Lakehouse Dilemma

Analytical databases excel at predicate push-down, column pruning, vectorized execution—on structured columns. Vector databases bloom on the side, optimized for ANN indices but alien to OLAP joins.

Operational DBs hold a slice, ETL copies into warehouses, specialized vector DBs fork yet another silo. Each hop costs latency, dollars, and governance headaches.

Enter e6data’s “Unify, Don’t Migrate” Motto

e6data’s philosophy—“Unify, Don’t Migrate”—embeds vector functions into the same optimizer that already handles column pruning and distributed scans. One planner, one security layer, zero copies.

Same table, same optimizer, same file reads—only now the planner injects a vector projection. The promise is seductive: no pipelines, zero duplication, federated permissions. Let’s answer the architectural “why”: pushing vector search into the query engine keeps compute close to data locality, sharing cache, scheduler, and security layers already battle-tested for petabyte SQL scans. It turns a maintenance nightmare into a compiler problem.

Semantic Search Up Close

Image credits: Medium

Semantic search works by transforming the unstructured data (images, documents, audio, video) into high-dimensional vector embeddings, i.e., numeric representations. But the way this works depends on the source data.

Take a look at this image:

Translate “too expensive” into [0.815, 0.642, -0.233, …] and suddenly “price is a bit much,” “can’t afford this,” and “costs more than expected” land in the same neighborhood even though they share zero lexical tokens.

Vectors turn free text into geometry; cosine similarity measures angular distance regardless of magnitude.  

A Tiny Experiment

Open your favorite SQL prompt and try the most innocent query against an Amazon reviews export:

SELECT review_id, review_headline
FROM   reviews
WHERE  review_headline ILIKE '%too expensive%'
LIMIT  10;

It returns a handful of hits and misses tens of thousands of paraphrases hiding behind different wording.

With Vector Search, the above query can be rewritten as:

SELECT review_id, review_headline
FROM   reviews
WHERE  cosine_distance(review_headline, 'too expensive') <0.1
LIMIT  10;

This represents a Nearest Neighbour search on review_headline column.

This query can now be extended to have more SQL constructs and use the best of SQL and Vector operations.

How Vector, Semantic, & Similarity Search Work

We’ve seen why keyword search buckles under the weight of customer slang, emojis, and multilingual nuances, and we’ve explored how silo-heavy architectures force teams into cost-intensive copy-and-paste gymnastics.

The takeaway is clear - to mine real insight from the 80% of data that hides in free-form text, we need an engine that speaks both SQL and semantics without spawning yet another datastore. That’s where vector, semantic, and similarity searches step in. But how do they actually convert messy language into precise geometry that the optimizer can race through?

  • Embedding – Models such as SBERT or OpenAI’s text-embedding-3-large convert each document into a 1 k-dimensional vector.
  • ANN Index – There are several indices you can use based on your data set and Vector DBs. Structures like HNSW, SCaNN, and DiskANN accelerate Approximate Nearest Neighbor lookups.
  • Query Flow for e6data bottom-up SQL engine architecture
    • Run the SQL filter
    • Reduce search space
    • Carry out vector search on the reduced search space

Conclusion

Unstructured data isn’t a side gig anymore; it’s the majority of your lakehouse. Vector search, semantic search, and similarity search turn those blobs into queryable geometry without abandoning SQL or spinning up another silo. In our next blog post, we’ll dive into embedding best practices, cosine similarity math, and writing first-class SQL over vectors.

Listen to the full podcast
Share this article

FAQs

How does e6data reduce Snowflake compute costs without slowing queries?
e6data is powered by the industry’s only atomic architecture. Rather than scaling in step jumps (L x 1 -> L x 2), e6data scales atomically, by as little as 1 vCPU. In production with widely varying loads, this translates to > 60% TCO savings.
Do I have to move out of Snowflake?
No, we fit right into your existing data architecture across cloud, on-prem, catalog, governance, table formats, BI tools, and more.

Does e6data speed up Iceberg on Snowflake?
Yes, depending on your workload, you can see anywhere up to 10x faster speeds through our native and advanced Iceberg support. 

Snowflake supports Iceberg. But how do you get data there in real time?
Our real-time streaming ingest streams Kafka or SDK data straight into Iceberg—no Flink. Landing within 60 seconds and auto-registering each snapshot for instant querying.

How long does it take to deploy e6data alongside Snowflake?
Sign up the form and get your instance started. You can deploy it to any cloud, region, deployment model, without copying or migrating any data from Snowflake.

Related posts

View All Posts

Related posts

View All
Engineering
This is some text inside of a div block.
June 6, 2025
/
Rajath Gowda
Solving Geospatial Analytics Performance Bottleneck: H3 vs Quadkey
Rajath Gowda
June 6, 2025
View All
Engineering
This is some text inside of a div block.
April 23, 2025
/
e6data Team
e6data’s Architectural Bets: our Head of Engineering’s conversation w/Pete at Zero Prime Podcast
e6data Team
April 23, 2025
View All
Product
This is some text inside of a div block.
April 16, 2025
/
e6data Team
Vector Search in MS Fabric: e6data Powers Unified SQL + Semantic Search at 60% lower cost
e6data Team
April 16, 2025
View All Posts