Engineering

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

Keyword search matches only the exact term, while similarity (vector) search matches semantic intent.

Vector Search, Similarity Search, and Semantic Search in the Lakehouse

Want to see e6data in action?

Learn how data teams power their workloads.

Get Demo
Get Demo

By 2025, the world will generate 175 zettabytes of data, and ≈80% of it will be unstructured; yet 90% of that trove is never analyzed(Forbes Tech Council, 2022). Conventional SQL predicates (LIKE '%price%') were designed for rows and columns, not emojis, paraphrases, multilingual nuances, or numeric representations. Vector search, semantic search, and similarity search provide a first scalable way, via high-dimensional vector embeddings, to make these dark corners of the lakehouse searchable without exporting petabytes to a specialist silo.

From Neat Tables to Messy Reality

Early “data warehouse” stories read like urban planning fables: define schema, enforce contracts, rinse, repeat. Then reality intruded. Modern ingest tiers happily dump semi-structured or fully opaque blobs into S3, Delta, and Iceberg: Pick your poison.

A reminder that unstructured data is outgrowing structured data

Text fields become data swamps. For example, a text field labeled ‘Customer Notes’ doesn’t really express what notes it may contain. There is no semantic around it. There are just blobs of text stored in this text field. The dashboard simply ignores sentiment, intent, and sarcasm hidden within text fields.

Text fields are stored in SQL, but it’s challenging to analyze because there is no built-in semantic understanding. Traditional predicates (LIKE '%price%') miss paraphrases, negations, emojis, and multilingual synonyms. The result is an analytics blind spot that grows in lock-step with your customer’s voice.

Why Keyword Search Isn’t Enough

Human language relies on meaning, not token coincidence. The image below compares Keyword Search and Similarity Search: the former highlights identical strings, while the latter clusters semantically proximate phrases, even when their stems diverge.

Keyword match = token coincidence; human understanding = semantic proximity.

semantic proximity ∝ 1 – cosine θ

Keyword search matches only the exact term, while similarity (vector) search matches semantic intent

If you type ‘enterprise plan’ in the keyword search, it ends up looking for the characters in the phrase. But an enterprise plan could mean a premium tier, guaranteed uptime, and white-glove support.  These are not evident when you want to do a textual lookup. A semantic search, in contrast, allows you to do that. When you look up ‘enterprise plan,’ you land on something that gives you details on the premium tier, guaranteed uptime, and white-glove support. Semantic search searches by meaning and not just words.

Example: Think of a support scenario. An agent tries to find “refund frustration” tickets, but the keyword engine surfaces only literal matches. In contrast, a vector engine bubbles up “product didn’t meet expectations,” “want my money back,” and a dozen more paraphrases that the agent would otherwise miss.

Query Keyword Search Finds Similarity Search Also Surfaces
“enterprise plan” Strings containing “enterprise plan” “premium tier”, “guaranteed uptime”, “white-glove support”
“refund frustration” Literal phrase only “want my money back”, “didn’t meet expectations”


Vector embeddings map every phrase into a high-dimensional space; cosine similarity then retrieves nearest neighbors regardless of wording, each vector being a numeric representation of linguistic context.

The Lakehouse Dilemma

Analytical databases excel at predicate push-down, column pruning, vectorized execution—on structured columns. Vector databases bloom on the side, optimized for ANN indices but alien to OLAP joins.

Operational DBs hold a slice, ETL copies into warehouses, specialized vector DBs fork yet another silo. Each hop costs latency, dollars, and governance headaches.

Option Strength Pain
Standalone Vector DB Millisecond ANN search New cluster, new governance, ETL hop
Traditional OLAP Joins, predicate push-down No semantic awareness
Lakehouse-native vector search Shares parquet metadata, cache & RBA Requires deep engine integration

Enter e6data’s “Unify, Don’t Migrate” Motto

e6data’s philosophy—“Unify, Don’t Migrate”—embeds vector functions into the same optimizer that already handles column pruning and distributed scans. One planner, one security layer, zero copies.

Same table, same optimizer, same file reads—only now the planner injects a vector projection. The promise is seductive: no pipelines, zero duplication, federated permissions. Let’s answer the architectural “why”: pushing vector search into the query engine keeps compute close to data locality, sharing cache, scheduler, and security layers already battle-tested for petabyte SQL scans. It turns a maintenance nightmare into a compiler problem.

Semantic Search Up Close

Image credits: Medium

Semantic search works by transforming the unstructured data (images, documents, audio, video) into high-dimensional vector embeddings, i.e., numeric representations. But the way this works depends on the source data.

Take a look at this image:

Translate “too expensive” into [0.815, 0.642, -0.233, …] and suddenly “price is a bit much,” “can’t afford this,” and “costs more than expected” land in the same neighborhood even though they share zero lexical tokens.

Vectors turn free text into geometry; cosine similarity measures angular distance regardless of magnitude.  

A Tiny Experiment

Open your favorite SQL prompt and try the most innocent query against an Amazon reviews export:

SELECT review_id, review_headline
FROM   reviews
WHERE  review_headline ILIKE '%too expensive%'
LIMIT  10;

It returns a handful of hits and misses tens of thousands of paraphrases hiding behind different wording.

With Vector Search, the above query can be rewritten as:

SELECT review_id, review_headline
FROM   reviews
WHERE  cosine_distance(review_headline, 'too expensive') <0.1
LIMIT  10;

This represents a Nearest Neighbour search on review_headline column.

This query can now be extended to have more SQL constructs and use the best of SQL and Vector operations.

How Vector, Semantic, & Similarity Search Work

We’ve seen why keyword search buckles under the weight of customer slang, emojis, and multilingual nuances, and we’ve explored how silo-heavy architectures force teams into cost-intensive copy-and-paste gymnastics.

The takeaway is clear - to mine real insight from the 80% of data that hides in free-form text, we need an engine that speaks both SQL and semantics without spawning yet another datastore. That’s where vector, semantic, and similarity searches step in. But how do they actually convert messy language into precise geometry that the optimizer can race through?

  • Embedding – Models such as SBERT or OpenAI’s text-embedding-3-large convert each document into a 1 k-dimensional vector.
  • ANN Index – There are several indices you can use based on your data set and Vector DBs. Structures like HNSW, SCaNN, and DiskANN accelerate Approximate Nearest Neighbor lookups.
  • Query Flow for e6data bottom-up SQL engine architecture
    • Run the SQL filter
    • Reduce search space
    • Carry out vector search on the reduced search space

Conclusion

Unstructured data isn’t a side gig anymore; it’s the majority of your lakehouse. Vector search, semantic search, and similarity search turn those blobs into queryable geometry without abandoning SQL or spinning up another silo. In our next blog post, we’ll dive into embedding best practices, cosine similarity math, and writing first-class SQL over vectors.

Share on

Build future-proof data products

Try e6data for your heavy workloads!

Get Started for Free
Get Started for Free
Frequently asked questions (FAQs)
How do I integrate e6data with my existing data infrastructure?

We are universally interoperable and open-source friendly. We can integrate across any object store, table format, data catalog, governance tools, BI tools, and other data applications.

How does billing work?

We use a usage-based pricing model based on vCPU consumption. Your billing is determined by the number of vCPUs used, ensuring you only pay for the compute power you actually consume.

What kind of file formats does e6data support?

We support all types of file formats, like Parquet, ORC, JSON, CSV, AVRO, and others.

What kind of performance improvements can I expect with e6data?

e6data promises a 5 to 10 times faster querying speed across any concurrency at over 50% lower total cost of ownership across the workloads as compared to any compute engine in the market.

What kinds of deployment models are available at e6data ?

We support serverless and in-VPC deployment models. 

How does e6data handle data governance rules?

We can integrate with your existing governance tool, and also have an in-house offering for data governance, access control, and security.

Table of contents:
Listen to the full podcast
Apple Podcasts
Spotify

Subscribe to our newsletter - Data Engineering ACID

Get 3 weekly stories around data engineering at scale that the e6data team is reading.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share this article

Vector & Semantic Search in the Lakehouse: Faster Insight from Unstructured Data

June 11, 2025
/
Adishesh Kishore
Srikanth Venugopalan
Engineering
Vector Search, Similarity Search, and Semantic Search in the Lakehouse

By 2025, the world will generate 175 zettabytes of data, and ≈80% of it will be unstructured; yet 90% of that trove is never analyzed(Forbes Tech Council, 2022). Conventional SQL predicates (LIKE '%price%') were designed for rows and columns, not emojis, paraphrases, multilingual nuances, or numeric representations. Vector search, semantic search, and similarity search provide a first scalable way, via high-dimensional vector embeddings, to make these dark corners of the lakehouse searchable without exporting petabytes to a specialist silo.

From Neat Tables to Messy Reality

Early “data warehouse” stories read like urban planning fables: define schema, enforce contracts, rinse, repeat. Then reality intruded. Modern ingest tiers happily dump semi-structured or fully opaque blobs into S3, Delta, and Iceberg: Pick your poison.

A reminder that unstructured data is outgrowing structured data

Text fields become data swamps. For example, a text field labeled ‘Customer Notes’ doesn’t really express what notes it may contain. There is no semantic around it. There are just blobs of text stored in this text field. The dashboard simply ignores sentiment, intent, and sarcasm hidden within text fields.

Text fields are stored in SQL, but it’s challenging to analyze because there is no built-in semantic understanding. Traditional predicates (LIKE '%price%') miss paraphrases, negations, emojis, and multilingual synonyms. The result is an analytics blind spot that grows in lock-step with your customer’s voice.

Why Keyword Search Isn’t Enough

Human language relies on meaning, not token coincidence. The image below compares Keyword Search and Similarity Search: the former highlights identical strings, while the latter clusters semantically proximate phrases, even when their stems diverge.

Keyword match = token coincidence; human understanding = semantic proximity.

semantic proximity ∝ 1 – cosine θ

Keyword search matches only the exact term, while similarity (vector) search matches semantic intent

If you type ‘enterprise plan’ in the keyword search, it ends up looking for the characters in the phrase. But an enterprise plan could mean a premium tier, guaranteed uptime, and white-glove support.  These are not evident when you want to do a textual lookup. A semantic search, in contrast, allows you to do that. When you look up ‘enterprise plan,’ you land on something that gives you details on the premium tier, guaranteed uptime, and white-glove support. Semantic search searches by meaning and not just words.

Example: Think of a support scenario. An agent tries to find “refund frustration” tickets, but the keyword engine surfaces only literal matches. In contrast, a vector engine bubbles up “product didn’t meet expectations,” “want my money back,” and a dozen more paraphrases that the agent would otherwise miss.

Query Keyword Search Finds Similarity Search Also Surfaces
“enterprise plan” Strings containing “enterprise plan” “premium tier”, “guaranteed uptime”, “white-glove support”
“refund frustration” Literal phrase only “want my money back”, “didn’t meet expectations”


Vector embeddings map every phrase into a high-dimensional space; cosine similarity then retrieves nearest neighbors regardless of wording, each vector being a numeric representation of linguistic context.

The Lakehouse Dilemma

Analytical databases excel at predicate push-down, column pruning, vectorized execution—on structured columns. Vector databases bloom on the side, optimized for ANN indices but alien to OLAP joins.

Operational DBs hold a slice, ETL copies into warehouses, specialized vector DBs fork yet another silo. Each hop costs latency, dollars, and governance headaches.

Option Strength Pain
Standalone Vector DB Millisecond ANN search New cluster, new governance, ETL hop
Traditional OLAP Joins, predicate push-down No semantic awareness
Lakehouse-native vector search Shares parquet metadata, cache & RBA Requires deep engine integration

Enter e6data’s “Unify, Don’t Migrate” Motto

e6data’s philosophy—“Unify, Don’t Migrate”—embeds vector functions into the same optimizer that already handles column pruning and distributed scans. One planner, one security layer, zero copies.

Same table, same optimizer, same file reads—only now the planner injects a vector projection. The promise is seductive: no pipelines, zero duplication, federated permissions. Let’s answer the architectural “why”: pushing vector search into the query engine keeps compute close to data locality, sharing cache, scheduler, and security layers already battle-tested for petabyte SQL scans. It turns a maintenance nightmare into a compiler problem.

Semantic Search Up Close

Image credits: Medium

Semantic search works by transforming the unstructured data (images, documents, audio, video) into high-dimensional vector embeddings, i.e., numeric representations. But the way this works depends on the source data.

Take a look at this image:

Translate “too expensive” into [0.815, 0.642, -0.233, …] and suddenly “price is a bit much,” “can’t afford this,” and “costs more than expected” land in the same neighborhood even though they share zero lexical tokens.

Vectors turn free text into geometry; cosine similarity measures angular distance regardless of magnitude.  

A Tiny Experiment

Open your favorite SQL prompt and try the most innocent query against an Amazon reviews export:

SELECT review_id, review_headline
FROM   reviews
WHERE  review_headline ILIKE '%too expensive%'
LIMIT  10;

It returns a handful of hits and misses tens of thousands of paraphrases hiding behind different wording.

With Vector Search, the above query can be rewritten as:

SELECT review_id, review_headline
FROM   reviews
WHERE  cosine_distance(review_headline, 'too expensive') <0.1
LIMIT  10;

This represents a Nearest Neighbour search on review_headline column.

This query can now be extended to have more SQL constructs and use the best of SQL and Vector operations.

How Vector, Semantic, & Similarity Search Work

We’ve seen why keyword search buckles under the weight of customer slang, emojis, and multilingual nuances, and we’ve explored how silo-heavy architectures force teams into cost-intensive copy-and-paste gymnastics.

The takeaway is clear - to mine real insight from the 80% of data that hides in free-form text, we need an engine that speaks both SQL and semantics without spawning yet another datastore. That’s where vector, semantic, and similarity searches step in. But how do they actually convert messy language into precise geometry that the optimizer can race through?

  • Embedding – Models such as SBERT or OpenAI’s text-embedding-3-large convert each document into a 1 k-dimensional vector.
  • ANN Index – There are several indices you can use based on your data set and Vector DBs. Structures like HNSW, SCaNN, and DiskANN accelerate Approximate Nearest Neighbor lookups.
  • Query Flow for e6data bottom-up SQL engine architecture
    • Run the SQL filter
    • Reduce search space
    • Carry out vector search on the reduced search space

Conclusion

Unstructured data isn’t a side gig anymore; it’s the majority of your lakehouse. Vector search, semantic search, and similarity search turn those blobs into queryable geometry without abandoning SQL or spinning up another silo. In our next blog post, we’ll dive into embedding best practices, cosine similarity math, and writing first-class SQL over vectors.

Listen to the full podcast
Share this article

FAQs

How does e6data reduce Snowflake compute costs without slowing queries?
e6data is powered by the industry’s only atomic architecture. Rather than scaling in step jumps (L x 1 -> L x 2), e6data scales atomically, by as little as 1 vCPU. In production with widely varying loads, this translates to > 60% TCO savings.
Do I have to move out of Snowflake?
No, we fit right into your existing data architecture across cloud, on-prem, catalog, governance, table formats, BI tools, and more.

Does e6data speed up Iceberg on Snowflake?
Yes, depending on your workload, you can see anywhere up to 10x faster speeds through our native and advanced Iceberg support. 

Snowflake supports Iceberg. But how do you get data there in real time?
Our real-time streaming ingest streams Kafka or SDK data straight into Iceberg—no Flink. Landing within 60 seconds and auto-registering each snapshot for instant querying.

How long does it take to deploy e6data alongside Snowflake?
Sign up the form and get your instance started. You can deploy it to any cloud, region, deployment model, without copying or migrating any data from Snowflake.

FAQs

What is vector search in a lakehouse?
Vector search in a lakehouse uses high-dimensional embeddings to find semantically similar content within unstructured data like text, images, or audio. It enables querying by meaning rather than exact keywords, making unstructured data searchable without exporting it to separate vector databases.
How does semantic search differ from keyword search?
Keyword search matches exact terms, while semantic search retrieves results based on meaning. For instance, searching 'too expensive' with semantic search also surfaces phrases like 'can't afford this' or 'costs more than expected', which keyword search would miss.
What are the benefits of integrating vector search directly into the lakehouse?
Integrating vector search into the lakehouse reduces data movement, maintains consistent security and governance, and leverages existing infrastructure for both structured and unstructured data analysis, leading to faster insights and lower costs.
How does cosine similarity work in vector search?
Cosine similarity measures the angle between two vectors in a high-dimensional space. A smaller angle (closer to 0) indicates higher similarity, allowing the system to identify semantically related content.

Related posts

View All Posts

Related posts

View All
Engineering
This is some text inside of a div block.
July 25, 2025
/
Rajath Gowda
Building a Modern Data Pipeline in Snowflake: From Snowpipe to Managed Iceberg Tables with Sync Checks
Rajath Gowda
July 25, 2025
View All
Product
This is some text inside of a div block.
July 24, 2025
/
e6data Team
Improved open-table analytics stack with Iceberg, Polaris, Hudi, Delta Lake
e6data Team
July 24, 2025
View All
Engineering
This is some text inside of a div block.
July 18, 2025
/
Sweta Singh
Procedural Power, Set-speed: Inside e6data’s Froid-inspired UDF Engine
Sweta Singh
July 18, 2025
View All Posts