Loading
Vectorize documents with embeddings, implement cosine similarity search, and build a ranked results UI with performance optimization.
Traditional keyword search fails when users describe concepts rather than exact terms. Embeddings-based search solves this by converting text into high-dimensional vectors where semantic similarity maps to geometric proximity. A user searching for "how to handle errors in async code" finds documents about exception handling, promise rejection, and try-catch patterns — even if those documents never use the word "error."
In this tutorial, you will build a complete semantic search engine. You will generate embeddings from document text, store them efficiently, implement cosine similarity ranking, expose a search API, and build a frontend that displays ranked results with relevance scores. You will also tackle the performance challenges that arise when your corpus grows beyond a few thousand documents.
What you will build:
Before writing code, understand what you are building on. An embedding model converts text into a fixed-length array of floating-point numbers — typically 384 to 1536 dimensions. Texts with similar meaning produce vectors that point in similar directions.
Cosine similarity measures the angle between two vectors, ignoring magnitude:
A score of 1.0 means identical direction (semantically identical), 0.0 means orthogonal (unrelated), and -1.0 means opposite. In practice, most document pairs score between 0.1 and 0.5, so even a score of 0.7 indicates strong relevance.
Raw documents are too long to embed as single vectors — models have token limits, and longer texts dilute meaning. Split documents into overlapping chunks that preserve context.
The overlap ensures that sentences split across chunk boundaries still appear intact in at least one chunk. A 512-word chunk with 64-word overlap is a practical default for most embedding models.
Use the sentence-transformers library to generate embeddings locally. The all-MiniLM-L6-v2 model produces 384-dimensional vectors and runs efficiently on CPU.
Pre-normalizing embeddings is critical — when vectors are unit-length, cosine similarity reduces to a dot product, which is significantly faster to compute.
Build an in-memory vector store that holds embeddings as a NumPy matrix. Searching becomes a single matrix multiplication against the query vector.
For 100,000 chunks at 384 dimensions, the embedding matrix occupies roughly 150MB — comfortable for a single server. The matrix multiplication search completes in under 10ms.
Expose the vector store through a REST API. Queries come in as text, get embedded, and return ranked results.
Add an endpoint to ingest new documents. The pipeline chunks the text, generates embeddings, and adds them to the store — all in a single request.
For production, you would run embedding generation in a background task queue. For this tutorial, synchronous processing keeps the architecture simple.
Build a search interface that sends queries as the user types (debounced) and renders results with relevance scores.
When multiple chunks from the same document match, the results list looks redundant. Group results by document and show the highest-scoring chunk as the primary result.
For durability without a database, serialize the vector store to disk using NumPy's memory-mapped arrays. This allows the store to survive restarts and loads faster than re-embedding.
As your corpus grows, brute-force search becomes a bottleneck. Two key optimizations extend the system to millions of documents.
Scalar quantization reduces memory by 4x with minimal accuracy loss. Convert float32 embeddings to uint8 by scaling each dimension to the 0-255 range:
Inverted file index (IVF) partitions vectors into clusters using k-means. At query time, only the nearest clusters are searched:
With IVF and quantization combined, search over 1 million documents completes in under 50ms. For even larger corpora, consider dedicated vector databases like pgvector or Qdrant — but understanding these primitives is essential for making informed architecture decisions.
import numpy as np
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if norm_a == 0 or norm_b == 0:
return 0.0
return float(dot_product / (norm_a * norm_b))from dataclasses import dataclass
@dataclass
class Chunk:
doc_id: str
chunk_index: int
text: str
embedding: list[float] | None = None
def chunk_document(doc_id: str, text: str, chunk_size: int = 512, overlap: int = 64) -> list[Chunk]:
words = text.split()
chunks: list[Chunk] = []
start = 0
index = 0
while start < len(words):
end = min(start + chunk_size, len(words))
chunk_text = " ".join(words[start:end])
chunks.append(Chunk(doc_id=doc_id, chunk_index=index, text=chunk_text))
start += chunk_size - overlap
index += 1
return chunksfrom sentence_transformers import SentenceTransformer
class EmbeddingService:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.dimension = self.model.get_sentence_embedding_dimension()
def embed_texts(self, texts: list[str]) -> np.ndarray:
embeddings = self.model.encode(
texts,
batch_size=32,
show_progress_bar=True,
normalize_embeddings=True, # Pre-normalize for faster cosine similarity
)
return np.array(embeddings, dtype=np.float32)
def embed_query(self, query: str) -> np.ndarray:
embedding = self.model.encode(
[query],
normalize_embeddings=True,
)
return np.array(embedding[0], dtype=np.float32)class VectorStore:
def __init__(self):
self.chunks: list[Chunk] = []
self.embeddings: np.ndarray | None = None
def add_chunks(self, chunks: list[Chunk], embeddings: np.ndarray) -> None:
for chunk, emb in zip(chunks, embeddings):
chunk.embedding = emb.tolist()
self.chunks.extend(chunks)
if self.embeddings is None:
self.embeddings = embeddings
else:
self.embeddings = np.vstack([self.embeddings, embeddings])
def search(self, query_embedding: np.ndarray, top_k: int = 10) -> list[tuple[Chunk, float]]:
if self.embeddings is None or len(self.chunks) == 0:
return []
# With normalized vectors, cosine similarity = dot product
scores = self.embeddings @ query_embedding
top_indices = np.argsort(scores)[::-1][:top_k]
results = []
for idx in top_indices:
results.append((self.chunks[idx], float(scores[idx])))
return results
def count(self) -> int:
return len(self.chunks)from fastapi import FastAPI, Query
from pydantic import BaseModel
app = FastAPI()
class SearchResult(BaseModel):
doc_id: str
chunk_index: int
text: str
score: float
@app.get("/api/search")
async def search(
q: str = Query(..., min_length=1, max_length=500),
top_k: int = Query(default=10, ge=1, le=100),
) -> list[SearchResult]:
query_embedding = embedding_service.embed_query(q)
results = vector_store.search(query_embedding, top_k=top_k)
return [
SearchResult(
doc_id=chunk.doc_id,
chunk_index=chunk.chunk_index,
text=chunk.text,
score=round(score, 4),
)
for chunk, score in results
]class IngestRequest(BaseModel):
doc_id: str
title: str
content: str
@app.post("/api/ingest")
async def ingest(request: IngestRequest) -> dict:
chunks = chunk_document(request.doc_id, request.content)
texts = [c.text for c in chunks]
embeddings = embedding_service.embed_texts(texts)
vector_store.add_chunks(chunks, embeddings)
return {
"doc_id": request.doc_id,
"chunks_created": len(chunks),
"total_chunks": vector_store.count(),
}// src/ui/search.ts
const searchInput = document.getElementById("search-input") as HTMLInputElement;
const resultsContainer = document.getElementById("results") as HTMLDivElement;
let debounceTimer: ReturnType<typeof setTimeout>;
searchInput.addEventListener("input", () => {
clearTimeout(debounceTimer);
debounceTimer = setTimeout(() => performSearch(searchInput.value), 300);
});
async function performSearch(query: string): Promise<void> {
if (query.length < 2) {
resultsContainer.innerHTML = "";
return;
}
const response = await fetch(`/api/search?q=${encodeURIComponent(query)}&top_k=20`);
const results: SearchResult[] = await response.json();
resultsContainer.innerHTML = results
.map(
(r) => `
<div class="result-card" style="border-left: 3px solid ${scoreToColor(r.score)}">
<div class="result-header">
<span class="doc-id">${r.doc_id}</span>
<span class="score">${(r.score * 100).toFixed(1)}% match</span>
</div>
<p class="result-text">${highlightQuery(r.text, query)}</p>
</div>
`
)
.join("");
}
function scoreToColor(score: number): string {
if (score > 0.7) return "#10b981";
if (score > 0.4) return "#f59e0b";
return "#6b7280";
}from itertools import groupby
from operator import attrgetter
def deduplicate_results(
results: list[tuple[Chunk, float]], max_per_doc: int = 2
) -> list[tuple[Chunk, float]]:
sorted_results = sorted(results, key=lambda r: r[0].doc_id)
deduped: list[tuple[Chunk, float]] = []
for _doc_id, group in groupby(sorted_results, key=lambda r: r[0].doc_id):
doc_results = sorted(group, key=lambda r: r[1], reverse=True)
deduped.extend(doc_results[:max_per_doc])
deduped.sort(key=lambda r: r[1], reverse=True)
return dedupedimport os
class PersistentVectorStore(VectorStore):
def __init__(self, storage_dir: str):
super().__init__()
self.storage_dir = storage_dir
os.makedirs(storage_dir, exist_ok=True)
def save(self) -> None:
if self.embeddings is not None:
np.save(os.path.join(self.storage_dir, "embeddings.npy"), self.embeddings)
import json
chunks_data = [
{"doc_id": c.doc_id, "chunk_index": c.chunk_index, "text": c.text}
for c in self.chunks
]
with open(os.path.join(self.storage_dir, "chunks.json"), "w") as f:
json.dump(chunks_data, f)
def load(self) -> None:
emb_path = os.path.join(self.storage_dir, "embeddings.npy")
chunks_path = os.path.join(self.storage_dir, "chunks.json")
if os.path.exists(emb_path) and os.path.exists(chunks_path):
self.embeddings = np.load(emb_path)
import json
with open(chunks_path) as f:
data = json.load(f)
self.chunks = [Chunk(**d) for d in data]def quantize(embeddings: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
mins = embeddings.min(axis=0)
maxs = embeddings.max(axis=0)
ranges = maxs - mins
ranges[ranges == 0] = 1.0
scaled = ((embeddings - mins) / ranges * 255).astype(np.uint8)
return scaled, mins, ranges
def dequantize(quantized: np.ndarray, mins: np.ndarray, ranges: np.ndarray) -> np.ndarray:
return (quantized.astype(np.float32) / 255.0) * ranges + minsfrom sklearn.cluster import MiniBatchKMeans
class IVFIndex:
def __init__(self, n_clusters: int = 64, n_probe: int = 8):
self.n_clusters = n_clusters
self.n_probe = n_probe
self.kmeans: MiniBatchKMeans | None = None
self.inverted_lists: dict[int, list[int]] = {}
def build(self, embeddings: np.ndarray) -> None:
self.kmeans = MiniBatchKMeans(n_clusters=self.n_clusters, batch_size=256)
labels = self.kmeans.fit_predict(embeddings)
self.inverted_lists = {}
for idx, label in enumerate(labels):
self.inverted_lists.setdefault(int(label), []).append(idx)
def search(self, query: np.ndarray, embeddings: np.ndarray, top_k: int) -> list[int]:
distances = self.kmeans.transform(query.reshape(1, -1))[0]
nearest_clusters = np.argsort(distances)[: self.n_probe]
candidate_indices = []
for c in nearest_clusters:
candidate_indices.extend(self.inverted_lists.get(int(c), []))
candidates = embeddings[candidate_indices]
scores = candidates @ query
top_local = np.argsort(scores)[::-1][:top_k]
return [candidate_indices[i] for i in top_local]