Engineering

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

Adishesh Kishore

Srikanth Venugopalan

June 11, 2025

Keyword search matches only the exact term, while similarity (vector) search matches semantic intent.

Vector Search, Similarity Search, and Semantic Search in the Lakehouse

Want to see e6data in action?

Learn how data teams power their workloads.

Get Demo

By 2025, the world will generate 175 zettabytes of data, and ≈80% of it will be unstructured; yet 90% of that trove is never analyzed(Forbes Tech Council, 2022). Conventional SQL predicates (LIKE '%price%') were designed for rows and columns, not emojis, paraphrases, multilingual nuances, or numeric representations. Vector search, semantic search, and similarity search provide a first scalable way, via high-dimensional vector embeddings, to make these dark corners of the lakehouse searchable without exporting petabytes to a specialist silo.

‍

From Neat Tables to Messy Reality

Early “data warehouse” stories read like urban planning fables: define schema, enforce contracts, rinse, repeat. Then reality intruded. Modern ingest tiers happily dump semi-structured or fully opaque blobs into S3, Delta, and Iceberg: Pick your poison.

A reminder that unstructured data is outgrowing structured data

‍

Text fields become data swamps. For example, a text field labeled ‘Customer Notes’ doesn’t really express what notes it may contain. There is no semantic around it. There are just blobs of text stored in this text field. The dashboard simply ignores sentiment, intent, and sarcasm hidden within text fields.

Text fields are stored in SQL, but it’s challenging to analyze because there is no built-in semantic understanding. Traditional predicates (LIKE '%price%') miss paraphrases, negations, emojis, and multilingual synonyms. The result is an analytics blind spot that grows in lock-step with your customer’s voice.

‍

Why Keyword Search Isn’t Enough

Human language relies on meaning, not token coincidence. The image below compares Keyword Search and Similarity Search: the former highlights identical strings, while the latter clusters semantically proximate phrases, even when their stems diverge.

Keyword match = token coincidence; human understanding = semantic proximity.

semantic proximity ∝ 1 – cosine θ

‍

If you type ‘enterprise plan’ in the keyword search, it ends up looking for the characters in the phrase. But an enterprise plan could mean a premium tier, guaranteed uptime, and white-glove support. These are not evident when you want to do a textual lookup. A semantic search, in contrast, allows you to do that. When you look up ‘enterprise plan,’ you land on something that gives you details on the premium tier, guaranteed uptime, and white-glove support. Semantic search searches by meaning and not just words.

Example: Think of a support scenario. An agent tries to find “refund frustration” tickets, but the keyword engine surfaces only literal matches. In contrast, a vector engine bubbles up “product didn’t meet expectations,” “want my money back,” and a dozen more paraphrases that the agent would otherwise miss.

Query	Keyword Search Finds	Similarity Search Also Surfaces
“enterprise plan”	Strings containing “enterprise plan”	“premium tier”, “guaranteed uptime”, “white-glove support”
“refund frustration”	Literal phrase only	“want my money back”, “didn’t meet expectations”

Vector embeddings map every phrase into a high-dimensional space; cosine similarity then retrieves nearest neighbors regardless of wording, each vector being a numeric representation of linguistic context.

‍

The Lakehouse Dilemma

Analytical databases excel at predicate push-down, column pruning, vectorized execution—on structured columns. Vector databases bloom on the side, optimized for ANN indices but alien to OLAP joins.

Operational DBs hold a slice, ETL copies into warehouses, specialized vector DBs fork yet another silo. Each hop costs latency, dollars, and governance headaches.
‍

Option	Strength	Pain
Standalone Vector DB	Millisecond ANN search	New cluster, new governance, ETL hop
Traditional OLAP	Joins, predicate push-down	No semantic awareness
Lakehouse-native vector search	Shares parquet metadata, cache & RBA	Requires deep engine integration

‍

Enter e6data’s “Unify, Don’t Migrate” Motto

e6data’s philosophy—“Unify, Don’t Migrate”—embeds vector functions into the same optimizer that already handles column pruning and distributed scans. One planner, one security layer, zero copies.

‍

‍

Same table, same optimizer, same file reads—only now the planner injects a vector projection. The promise is seductive: no pipelines, zero duplication, federated permissions. Let’s answer the architectural “why”: pushing vector search into the query engine keeps compute close to data locality, sharing cache, scheduler, and security layers already battle-tested for petabyte SQL scans. It turns a maintenance nightmare into a compiler problem.

‍

Semantic Search Up Close

‍

Semantic search works by transforming the unstructured data (images, documents, audio, video) into high-dimensional vector embeddings, i.e., numeric representations. But the way this works depends on the source data.

Take a look at this image:

‍

Translate “too expensive” into [0.815, 0.642, -0.233, …] and suddenly “price is a bit much,” “can’t afford this,” and “costs more than expected” land in the same neighborhood even though they share zero lexical tokens.

Vectors turn free text into geometry; cosine similarity measures angular distance regardless of magnitude.

‍

A Tiny Experiment

Open your favorite SQL prompt and try the most innocent query against an Amazon reviews export:

SELECT review_id, review_headline
FROM   reviews
WHERE  review_headline ILIKE '%too expensive%'
LIMIT  10;

It returns a handful of hits and misses tens of thousands of paraphrases hiding behind different wording.

With Vector Search, the above query can be rewritten as:

SELECT review_id, review_headline
FROM   reviews
WHERE  cosine_distance(review_headline, 'too expensive') <0.1
LIMIT  10;

‍

This represents a Nearest Neighbour search on review_headline column.

This query can now be extended to have more SQL constructs and use the best of SQL and Vector operations.

‍

How Vector, Semantic, & Similarity Search Work

We’ve seen why keyword search buckles under the weight of customer slang, emojis, and multilingual nuances, and we’ve explored how silo-heavy architectures force teams into cost-intensive copy-and-paste gymnastics.

The takeaway is clear - to mine real insight from the 80% of data that hides in free-form text, we need an engine that speaks both SQL and semantics without spawning yet another datastore. That’s where vector, semantic, and similarity searches step in. But how do they actually convert messy language into precise geometry that the optimizer can race through?

Embedding – Models such as SBERT or OpenAI’s text-embedding-3-large convert each document into a 1 k-dimensional vector.
ANN Index – There are several indices you can use based on your data set and Vector DBs. Structures like HNSW, SCaNN, and DiskANN accelerate Approximate Nearest Neighbor lookups.
Query Flow for e6data bottom-up SQL engine architecture
- Run the SQL filter
- Reduce search space
- Carry out vector search on the reduced search space

‍

Conclusion

Unstructured data isn’t a side gig anymore; it’s the majority of your lakehouse. Vector search, semantic search, and similarity search turn those blobs into queryable geometry without abandoning SQL or spinning up another silo. In our next blog post, we’ll dive into embedding best practices, cosine similarity math, and writing first-class SQL over vectors.

‍

Share on

Build future-proof data products

Try e6data for your heavy workloads!

Get Started for Free

Frequently asked questions (FAQs)

How do I integrate e6data with my existing data infrastructure?

How does billing work?

What kind of file formats does e6data support?

What kind of performance improvements can I expect with e6data?

What kinds of deployment models are available at e6data ?

How does e6data handle data governance rules?

Available at

Blog Events Docs

Terms and Conditions Privacy Policy Cookie Policy

Back

Table of contents:

Listen to the full podcast

Apple Podcasts

Spotify

Share this article

Back

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

June 11, 2025

Adishesh Kishore

Srikanth Venugopalan

Engineering

Vector Search, Similarity Search, and Semantic Search in the Lakehouse

‍

From Neat Tables to Messy Reality

‍

Why Keyword Search Isn’t Enough

Keyword match = token coincidence; human understanding = semantic proximity.

semantic proximity ∝ 1 – cosine θ

‍

Query	Keyword Search Finds	Similarity Search Also Surfaces
“enterprise plan”	Strings containing “enterprise plan”	“premium tier”, “guaranteed uptime”, “white-glove support”
“refund frustration”	Literal phrase only	“want my money back”, “didn’t meet expectations”

‍

The Lakehouse Dilemma

Operational DBs hold a slice, ETL copies into warehouses, specialized vector DBs fork yet another silo. Each hop costs latency, dollars, and governance headaches.
‍

Option	Strength	Pain
Standalone Vector DB	Millisecond ANN search	New cluster, new governance, ETL hop
Traditional OLAP	Joins, predicate push-down	No semantic awareness
Lakehouse-native vector search	Shares parquet metadata, cache & RBA	Requires deep engine integration

‍

Enter e6data’s “Unify, Don’t Migrate” Motto

‍

‍

Semantic Search Up Close

‍

Take a look at this image:

‍

Vectors turn free text into geometry; cosine similarity measures angular distance regardless of magnitude.

‍

A Tiny Experiment

Open your favorite SQL prompt and try the most innocent query against an Amazon reviews export:

SELECT review_id, review_headline
FROM   reviews
WHERE  review_headline ILIKE '%too expensive%'
LIMIT  10;

It returns a handful of hits and misses tens of thousands of paraphrases hiding behind different wording.

With Vector Search, the above query can be rewritten as:

SELECT review_id, review_headline
FROM   reviews
WHERE  cosine_distance(review_headline, 'too expensive') <0.1
LIMIT  10;

‍

This represents a Nearest Neighbour search on review_headline column.

This query can now be extended to have more SQL constructs and use the best of SQL and Vector operations.

‍

How Vector, Semantic, & Similarity Search Work

Embedding – Models such as SBERT or OpenAI’s text-embedding-3-large convert each document into a 1 k-dimensional vector.
ANN Index – There are several indices you can use based on your data set and Vector DBs. Structures like HNSW, SCaNN, and DiskANN accelerate Approximate Nearest Neighbor lookups.
Query Flow for e6data bottom-up SQL engine architecture
- Run the SQL filter
- Reduce search space
- Carry out vector search on the reduced search space

‍

Conclusion

‍

Listen to the full podcast

Apple Podcasts

Spotify

Share this article

FAQs

What is vector search in a lakehouse?

Vector search in a lakehouse uses high-dimensional embeddings to find semantically similar content within unstructured data like text, images, or audio. It enables querying by meaning rather than exact keywords, making unstructured data searchable without exporting it to separate vector databases.

How does semantic search differ from keyword search?

Keyword search matches exact terms, while semantic search retrieves results based on meaning. For instance, searching 'too expensive' with semantic search also surfaces phrases like 'can't afford this' or 'costs more than expected', which keyword search would miss.

What are the benefits of integrating vector search directly into the lakehouse?

Integrating vector search into the lakehouse reduces data movement, maintains consistent security and governance, and leverages existing infrastructure for both structured and unstructured data analysis, leading to faster insights and lower costs.

How does cosine similarity work in vector search?

Cosine similarity measures the angle between two vectors in a high-dimensional space. A smaller angle (closer to 0) indicates higher similarity, allowing the system to identify semantically related content.

View All Posts

Engineering

July 25, 2025

Rajath Gowda

Building a Modern Data Pipeline in Snowflake: From Snowpipe to Managed Iceberg Tables with Sync Checks

Rajath Gowda

July 25, 2025

Product

July 24, 2025

e6data Team

Improved open-table analytics stack with Iceberg, Polaris, Hudi, Delta Lake

e6data Team

July 24, 2025

Engineering

July 18, 2025

Sweta Singh

Procedural Power, Set-speed: Inside e6data’s Froid-inspired UDF Engine

Sweta Singh

July 18, 2025

View All Posts

Available at

Blog Events Docs

Terms and Conditions Privacy Policy Cookie Policy

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

From Neat Tables to Messy Reality

Why Keyword Search Isn’t Enough

The Lakehouse Dilemma

Enter e6data’s “Unify, Don’t Migrate” Motto

Semantic Search Up Close

A Tiny Experiment

How Vector, Semantic, & Similarity Search Work

Conclusion

Build future-proof data products

Frequently asked questions (FAQs)

Subscribe to our newsletter - Data Engineering ACID

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

From Neat Tables to Messy Reality

Why Keyword Search Isn’t Enough

The Lakehouse Dilemma

Enter e6data’s “Unify, Don’t Migrate” Motto

Semantic Search Up Close

A Tiny Experiment

How Vector, Semantic, & Similarity Search Work

Conclusion

FAQs

FAQs

Related posts

Related posts

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

From Neat Tables to Messy Reality

Why Keyword Search Isn’t Enough

The Lakehouse Dilemma

Enter e6data’s “Unify, Don’t Migrate” Motto

Semantic Search Up Close

A Tiny Experiment

How Vector, Semantic, & Similarity Search Work

Conclusion

View more articles

Build future-proof data products

Frequently asked questions (FAQs)

FAQs

FAQs

Related posts

Related posts