In the last post of our Vector Search Blog Series, Vector Search in the Lakehouse: Unlocking Unstructured Data, we introduced how text embeddings can turbocharge search by capturing semantic meaning, allowing SQL queries to find results by concept rather than exact keywords. Now, let’s dive deeper into the practical details behind that approach. We’ll go through some SQL examples using cosine similarity, compare a few popular embedding models (MiniLM, SciBERT, CodeBERT) and their characteristics, outline how e6data uses LOAD VECTOR
to generate embeddings, and share guidance on choosing cosine similarity thresholds for semantic search.
One of the biggest benefits of vector embeddings is the ability to perform similarity search in SQL using cosine similarity (or distance) metrics. The earlier blog in the series had a simple example using an Amazon product reviews dataset. Let’s use the same example again. Suppose we want to find reviews with headlines semantically similar to the phrase “amazingly good.”
Before (Keyword Search): Originally, we might have used a keyword filter like this:
SELECT review_id, review_headline
FROM reviews
WHERE review_headline ILIKE '%amazingly good%';
This would return only rows where the exact words “amazingly good” appear in the headline, missing out on tons of phrasing variations. In fact, running a query like the above will only fetch a handful of hits and miss tens of thousands of paraphrases expressed in different words. However, if keyword search is powered by text indexing solutions like BM25 indices, users can carry out searches that can be a synonym or a partial match.
After (Vector Similarity Search): e6data tries to keep vector embeddings transparent to the search operation, so the end user would query the column as if it were a predicate on the raw text, but e6data understands that it is a vector search operation and delegates the lookup to the underlying embeddings automatically. At runtime, the engine derives an embedding for review_headline
on demand, so the SQL remains focused on the original column while the vector math happens underneath. A query to retrieve reviews semantically close to “amazingly good,” therefore looks like this:
SELECT
review_id,
review_headline,
cosine_distance(review_headline, 'amazingly good') AS cos_dist
FROM reviews
ORDER BY cos_dist ASC
LIMIT 10;
Here, cosine_distance(review_headline, 'amazingly good')
computes the cosine distance between the review’s headline embedding and the embedding of the phrase “amazingly good”. The results are ordered by cos_dist
ascending (i.e., most similar first). The query performs a nearest-neighbor search in embedding space, surfacing semantically related reviews even when they don’t share the literal words, for example, “absolutely fantastic purchase” or “really good quality.” Any headline conveying a sentiment similar to “amazingly good” moves to the top.
Comparing the before and after queries makes the advantage pretty clear. The “before” version depended on literal string matching; the “after” taps directly into the embedding space while preserving the familiar column names. We get significantly improved recall of relevant results without changing our schema or query style.
Not all embeddings are created equal. Let’s compare three pre-trained models for generating vectors, highlighting their dimensions, speed (latency), and domain suitability:
With a model selected (MiniLM for its speed, SciBERT for scientific nuance, or CodeBERT when we’re embedding code), the next step is to turn raw text into vectors inside the lakehouse. That hand-off from “which model” to “how we generate and store” is handled by this command.
LOAD VECTOR INTO reviews(review_headline);
When this statement runs, the engine performs a parallel scan of reviews
. For each headline, it calls the text-embedding service you chose. The command never writes an _embedding
column back to the table; instead, it emits vector files in a purpose-built storage format we call Blitz.
Blitz stores vectors contiguously, with an offset table that supports O(1) lookup. During ingestion, we also apply scalar quantization, where 8-bit integers replace 32-bit floats, shrinking each 768-dimension vector from roughly 3 KB to about 0.75 KB. Quantization is lossy, but it preserves enough angular information for a coarse search and dramatically improves cache locality.
After vectors are on disk, every similarity query follows a two-pass plan:
Because quantized data are an eighth the size of the originals, Pass 1 is memory and I/O-friendly. Pass 2 touches only a tiny fraction of vectors, so the cost of full precision is negligible. The overall effect is sub-second search over millions of rows without maintaining a separate vector database; the vectors live in the lakehouse, managed by the same transactional metadata as the base table.
One often asked question is: What cosine similarity (or distance) threshold should I use for filtering results? We used a threshold in the WHERE clause (e.g., cosine_distance(...) < 0.1
) to only return close matches. Picking this threshold required a bit of tuning and understanding of the embedding space.
Here’s an experience of adjusting the threshold from 0.2 to 0.1 and its impact:
The “sweet spot” depends on your application’s needs. We suggest an approach to find this: “visualize the distribution” of cosine similarities between your query (or a set of example queries) and a large sample of data. In practice, we took a random sample of pairwise cosine distances and plotted a histogram. What we saw was a sort of bimodal or skewed distribution; many pairs were dissimilar (distance closer to 1), and a smaller cluster was highly similar (distance near 0). This helps inform a threshold: you might see a “gap” or elbow in the distribution where similar vs. dissimilar items separate.
We also validated against known pairs to choose the threshold. In our case, we had some known duplicate or equivalent phrases in the data. For example, we knew “didn’t meet expectations” should match “want my money back” (both indicating refund requests), whereas “arrived late” should not match “not worth the price.” We checked the cosine distances for these known examples: the similar pair had a distance around 0.05, and the unrelated pair was around 0.3. This gave confidence that setting a cutoff around 0.1 or 0.15 would include the good pair and exclude the bad one. Essentially, we confirmed the threshold with a few examples.
Our guidance on thresholds is: start with something somewhat strict (like 0.1 for distance, i.e., 90% similarity) and then adjust if you feel you’re missing too many results. If you retrieve too few hits, loosen it (0.2 or even higher). If you see off-target results, tighten it. The distribution plot and sample pairs method can greatly accelerate this tuning by giving you a sense of typical distances in your dataset. Every model and dataset will behave a bit differently (for instance, a model might generally produce tighter clusters vs. spread-out embeddings), so there’s no one-size-fits-all number. Threshold tuning is an iterative process. It’s worth spending time on because it directly affects the quality of your semantic search results.
Finally, remember that you don’t always need a hard cutoff. Sometimes, you might sort by similarity and just take the top K results instead of thresholding absolute values. We did this in some of our demo queries (just ordering by cos_dist
and limiting to 10 or 50 results). This way, you always get some results back, ranked by relevance. A threshold is more useful when you want to say “only show very similar matches or nothing.” In a production setting, we might use a combination: e.g., take the top 50 but also drop anything with a distance above 0.3 as a sanity filter.
By enriching SQL with vector search capabilities, we bridge the gap between unstructured text and structured queries. In this deep dive, we expanded on how exactly we implement and fine-tune such a system: from writing SQL queries using cosine_distance
for semantic matching, to weighing different embedding models for quality vs. speed, to building a robust pipeline that generates and stores embeddings in our lakehouse, and finally to tuning similarity thresholds that yield relevant results. All these details come directly from our project’s journey, and we hope they help you in your own implementation.
The key takeaway is that working with embeddings in SQL is not only feasible, it’s incredibly powerful. With the right model and careful engineering, you can unlock new insights from text data within your familiar data warehouse or lakehouse environment. And as always, the journey is iterative: test, visualize, and tune as you integrate vector search into your applications.