Build an Embeddings Search Engine

Overview

Traditional keyword search fails when users describe concepts rather than exact terms. Embeddings-based search solves this by converting text into high-dimensional vectors where semantic similarity maps to geometric proximity. A user searching for "how to handle errors in async code" finds documents about exception handling, promise rejection, and try-catch patterns — even if those documents never use the word "error."

In this tutorial, you will build a complete semantic search engine. You will generate embeddings from document text, store them efficiently, implement cosine similarity ranking, expose a search API, and build a frontend that displays ranked results with relevance scores. You will also tackle the performance challenges that arise when your corpus grows beyond a few thousand documents.

What you will build:

A document ingestion pipeline that chunks and vectorizes text
An in-memory vector store with cosine similarity search
A REST API for semantic queries
A search UI with ranked results and relevance highlighting
Performance optimizations including indexing and quantization

Step 1: Understanding Embeddings and Vector Space

Before writing code, understand what you are building on. An embedding model converts text into a fixed-length array of floating-point numbers — typically 384 to 1536 dimensions. Texts with similar meaning produce vectors that point in similar directions.

Cosine similarity measures the angle between two vectors, ignoring magnitude:

A score of 1.0 means identical direction (semantically identical), 0.0 means orthogonal (unrelated), and -1.0 means opposite. In practice, most document pairs score between 0.1 and 0.5, so even a score of 0.7 indicates strong relevance.

Step 2: Document Ingestion and Chunking

Raw documents are too long to embed as single vectors — models have token limits, and longer texts dilute meaning. Split documents into overlapping chunks that preserve context.

The overlap ensures that sentences split across chunk boundaries still appear intact in at least one chunk. A 512-word chunk with 64-word overlap is a practical default for most embedding models.

Step 3: Generating Embeddings

Use the sentence-transformers library to generate embeddings locally. The all-MiniLM-L6-v2 model produces 384-dimensional vectors and runs efficiently on CPU.

Pre-normalizing embeddings is critical — when vectors are unit-length, cosine similarity reduces to a dot product, which is significantly faster to compute.

Step 4: Vector Store Implementation

Build an in-memory vector store that holds embeddings as a NumPy matrix. Searching becomes a single matrix multiplication against the query vector.

For 100,000 chunks at 384 dimensions, the embedding matrix occupies roughly 150MB — comfortable for a single server. The matrix multiplication search completes in under 10ms.

Step 5: Search API with FastAPI

Expose the vector store through a REST API. Queries come in as text, get embedded, and return ranked results.

Step 6: Document Ingestion Endpoint

Add an endpoint to ingest new documents. The pipeline chunks the text, generates embeddings, and adds them to the store — all in a single request.

For production, you would run embedding generation in a background task queue. For this tutorial, synchronous processing keeps the architecture simple.

Step 7: Search Frontend

Build a search interface that sends queries as the user types (debounced) and renders results with relevance scores.

Step 8: Result Deduplication and Grouping

When multiple chunks from the same document match, the results list looks redundant. Group results by document and show the highest-scoring chunk as the primary result.

Step 9: Persistence with Memory-Mapped Files

For durability without a database, serialize the vector store to disk using NumPy's memory-mapped arrays. This allows the store to survive restarts and loads faster than re-embedding.

Step 10: Performance Optimization

As your corpus grows, brute-force search becomes a bottleneck. Two key optimizations extend the system to millions of documents.

Scalar quantization reduces memory by 4x with minimal accuracy loss. Convert float32 embeddings to uint8 by scaling each dimension to the 0-255 range:

Inverted file index (IVF) partitions vectors into clusters using k-means. At query time, only the nearest clusters are searched:

With IVF and quantization combined, search over 1 million documents completes in under 50ms. For even larger corpora, consider dedicated vector databases like pgvector or Qdrant — but understanding these primitives is essential for making informed architecture decisions.