By 2025, the world will generate 175 zettabytes of data, and ≈80% of it will be unstructured; yet 90% of that trove is never analyzed(Forbes Tech Council, 2022). Conventional SQL predicates (LIKE '%price%') were designed for rows and columns, not emojis, paraphrases, multilingual nuances, or numeric representations. Vector search, semantic search, and similarity search provide a first scalable way, via high-dimensional vector embeddings, to make these dark corners of the lakehouse searchable without exporting petabytes to a specialist silo.
Early “data warehouse” stories read like urban planning fables: define schema, enforce contracts, rinse, repeat. Then reality intruded. Modern ingest tiers happily dump semi-structured or fully opaque blobs into S3, Delta, and Iceberg: Pick your poison.
Text fields become data swamps. For example, a text field labeled ‘Customer Notes’ doesn’t really express what notes it may contain. There is no semantic around it. There are just blobs of text stored in this text field. The dashboard simply ignores sentiment, intent, and sarcasm hidden within text fields.
Text fields are stored in SQL, but it’s challenging to analyze because there is no built-in semantic understanding. Traditional predicates (LIKE '%price%'
) miss paraphrases, negations, emojis, and multilingual synonyms. The result is an analytics blind spot that grows in lock-step with your customer’s voice.
Human language relies on meaning, not token coincidence. The image below compares Keyword Search and Similarity Search: the former highlights identical strings, while the latter clusters semantically proximate phrases, even when their stems diverge.
Keyword match = token coincidence; human understanding = semantic proximity.
semantic proximity ∝ 1 – cosine θ
If you type ‘enterprise plan’ in the keyword search, it ends up looking for the characters in the phrase. But an enterprise plan could mean a premium tier, guaranteed uptime, and white-glove support. These are not evident when you want to do a textual lookup. A semantic search, in contrast, allows you to do that. When you look up ‘enterprise plan,’ you land on something that gives you details on the premium tier, guaranteed uptime, and white-glove support. Semantic search searches by meaning and not just words.
Example: Think of a support scenario. An agent tries to find “refund frustration” tickets, but the keyword engine surfaces only literal matches. In contrast, a vector engine bubbles up “product didn’t meet expectations,” “want my money back,” and a dozen more paraphrases that the agent would otherwise miss.
Vector embeddings map every phrase into a high-dimensional space; cosine similarity then retrieves nearest neighbors regardless of wording, each vector being a numeric representation of linguistic context.
Analytical databases excel at predicate push-down, column pruning, vectorized execution—on structured columns. Vector databases bloom on the side, optimized for ANN indices but alien to OLAP joins.
Operational DBs hold a slice, ETL copies into warehouses, specialized vector DBs fork yet another silo. Each hop costs latency, dollars, and governance headaches.
e6data’s philosophy—“Unify, Don’t Migrate”—embeds vector functions into the same optimizer that already handles column pruning and distributed scans. One planner, one security layer, zero copies.
Same table, same optimizer, same file reads—only now the planner injects a vector projection. The promise is seductive: no pipelines, zero duplication, federated permissions. Let’s answer the architectural “why”: pushing vector search into the query engine keeps compute close to data locality, sharing cache, scheduler, and security layers already battle-tested for petabyte SQL scans. It turns a maintenance nightmare into a compiler problem.
Semantic search works by transforming the unstructured data (images, documents, audio, video) into high-dimensional vector embeddings, i.e., numeric representations. But the way this works depends on the source data.
Take a look at this image:
Translate “too expensive” into [0.815, 0.642, -0.233, …] and suddenly “price is a bit much,” “can’t afford this,” and “costs more than expected” land in the same neighborhood even though they share zero lexical tokens.
Vectors turn free text into geometry; cosine similarity measures angular distance regardless of magnitude.
Open your favorite SQL prompt and try the most innocent query against an Amazon reviews export:
SELECT review_id, review_headline
FROM reviews
WHERE review_headline ILIKE '%too expensive%'
LIMIT 10;
It returns a handful of hits and misses tens of thousands of paraphrases hiding behind different wording.
With Vector Search, the above query can be rewritten as:
SELECT review_id, review_headline
FROM reviews
WHERE cosine_distance(review_headline, 'too expensive') <0.1
LIMIT 10;
This represents a Nearest Neighbour search on review_headline column.
This query can now be extended to have more SQL constructs and use the best of SQL and Vector operations.
We’ve seen why keyword search buckles under the weight of customer slang, emojis, and multilingual nuances, and we’ve explored how silo-heavy architectures force teams into cost-intensive copy-and-paste gymnastics.
The takeaway is clear - to mine real insight from the 80% of data that hides in free-form text, we need an engine that speaks both SQL and semantics without spawning yet another datastore. That’s where vector, semantic, and similarity searches step in. But how do they actually convert messy language into precise geometry that the optimizer can race through?
Unstructured data isn’t a side gig anymore; it’s the majority of your lakehouse. Vector search, semantic search, and similarity search turn those blobs into queryable geometry without abandoning SQL or spinning up another silo. In our next blog post, we’ll dive into embedding best practices, cosine similarity math, and writing first-class SQL over vectors.