Chapter 8: Embeddings, Vector Search, and LLM-Aided Features

Chapter Introduction

Classical regression — the workhorse of Chapter 2 — takes a numeric matrix and returns predictions. Most of the data inside a modern organisation is not a numeric matrix. Earnings-call transcripts, news headlines, SEC filings, clinical notes, customer-support tickets, product images, satellite imagery, audio recordings — all of it has to be converted to numbers before any statistical method can touch it. This chapter teaches the feature-engineering layer that does that conversion, and how the converted features plug into the methods of the previous chapters.

Four families of methods sit in this layer, in order of how often they appear in real pipelines:

Embeddings — fixed-length numeric vectors that encode unstructured objects (text, images, audio) such that semantically similar objects are close together. Used as features in any of the regression / clustering / pattern-recognition machinery from earlier chapters.
Vector search and retrieval-augmented generation (RAG) — once you have embeddings, you index them so that “find the most similar object to this query” is a millisecond operation. RAG layers an LLM on top of retrieval to answer questions grounded in a knowledge base.
Graph neural networks (GNNs) — when the data has natural relational structure (a social graph, a citation network, a customer–product bipartite graph), GNNs learn representations that respect that structure.
LLM-aided structured extraction — let a large language model turn free-form text into structured columns (a guidance_change flag from a transcript, a severity field from a customer ticket). The structured columns are fed into the classical regression / classification machinery downstream.

A note on what is honest to show in a browser. Real transformer embeddings — sentence-transformers, OpenAI embeddings, Voyage — are gigabytes of weights and don’t run in Pyodide. What does run in the browser is the workflow: build a toy embedding with TF-IDF + Truncated SVD (the 1980s technique that is the conceptual ancestor of modern transformer embeddings), implement vector search by hand, run a small GNN message-passing demonstration in pure NumPy, and use a regex stand-in for the LLM extraction. In production each toy block is swapped for a transformer or LLM call, but the surrounding statistical pipeline is identical.

This chapter is deliberately focused on the methods — how to compute, fit, and use each tool. The architectural questions of how the resulting features attach to a domain ontology, how the pipeline is governed, and how lineage is tracked belong to the practitioner’s discipline of domain modelling and are covered in the companion volume Domain Modelling in Python.

Graph Neural Networks — Learning on the Graph
Embeddings — From Unstructured Data to Statistical Features
Vector Databases and Retrieval-Augmented Generation (RAG)
Cross-Modal Embeddings and Domain Fine-Tuning
LLM-Aided Hypothesis and Feature Discovery

Graph Neural Networks — Learning on the Graph

A traditional regression treats each row of data as independent. A Graph Neural Network (GNN) treats them as connected — each row knows about the rows it links to, and the model’s prediction for one node depends on the features of its neighbours, the features of their neighbours, and so on. Whenever the data has natural relational structure — a social graph, a citation network, a customer–product bipartite graph, a road network, a molecular structure — GNNs let downstream prediction exploit that structure as features in a way classical methods cannot.

The core operation is message passing. For each node \(v\) with feature vector \(h_v\) and neighbour set \(\mathcal{N}(v)\), one layer of a GNN computes

\[ h_v^{(\ell+1)} \;=\; \sigma\!\left(W^{(\ell)}\,h_v^{(\ell)} \;+\; U^{(\ell)} \cdot \text{aggregate}_{u \in \mathcal{N}(v)}\!\big(h_u^{(\ell)}\big)\right). \]

The aggregate function is typically a mean, sum, or attention-weighted sum. After \(L\) layers, each node’s representation reflects information from \(L\) hops away. Stacking 2–3 layers is usually enough; deeper stacks suffer from over-smoothing.

The three main GNN variants:

GCN (Kipf & Welling 2017) — normalised mean aggregation; the textbook starting point.
GraphSAGE (Hamilton et al. 2017) — neighbour sampling; scales to graphs with millions of nodes.
GAT (Veličković et al. 2018) — attention-weighted aggregation; learns which neighbours matter for each node.

Real-world deployments: Pinterest uses GraphSAGE on a 3-billion-node graph to power related-pin recommendations; Uber Eats uses GAT to recommend dishes; Two Sigma has published research on GNN-based factor models; Amazon uses GNNs in fraud detection on the buyer-seller graph; Google Maps uses spatio-temporal GNNs to predict traffic.

Module reference — GNN libraries

Outside Pyodide: - PyTorch Geometric (torch_geometric) — the dominant GNN library; convenience layers for GCN, GraphSAGE, GAT, and dozens of variants. - DGL (dgl) — alternative library with multi-backend support. - Spektral — Keras-based GNNs. - NetworkX does not implement GNNs but is fine for the graph structure that feeds them. Inside Pyodide we hand-implement one GCN-style layer in pure NumPy below — 20 lines and you understand what the production libraries are wrapping.

Even with random weights, two GCN layers visibly improve cluster recovery — message passing alone (no training) injects neighbourhood structure into the representation. With trained weights and a proper loss, GNNs routinely beat tabular baselines on any task where relational structure matters.

When not to reach for GNNs:

Your graph is sparse and weakly informative — message passing washes out the signal.
Tabular features carry all the information — a tree ensemble (Chapter 2) wins for simplicity.
The graph changes faster than you can re-train — inductive GraphSAGE-style models help, but operational cost is real.

A GraphSAGE model on the bipartite user-product graph. Each user / product is a node; an edge encodes a past purchase; node features are demographics and product metadata; the GNN learns embeddings that capture multi-hop user-product-user-product similarity invisible to handcrafted features. Pinterest’s “PinSage” (Ying et al. 2018) is the published reference — a GraphSAGE on a 3-billion-node graph that significantly outperformed the prior handcrafted-feature production model.

Embeddings — From Unstructured Data to Statistical Features

An embedding is a function from an unstructured object (a sentence, a paragraph, a 10-K filing, an image, a sound clip) to a fixed-length numeric vector. Two key properties:

The vector has constant dimension (typically 384, 768, or 1024) regardless of input length.
Vectors of semantically similar objects are close in Euclidean distance.

The killer application: once your unstructured object is an embedding, every method in this book applies to it. Regression on text. Clustering of clinical notes. Anomaly detection on news headlines. PCA-factor models on Reddit posts. The classical statistical pipeline doesn’t change — only the front-end feature extractor changes.

Real-world production embeddings:

Sentence-transformers (all-MiniLM-L6-v2, all-mpnet-base-v2) — the open-source workhorse; ~80MB to ~440MB; runs on CPU.
OpenAI text-embedding-3-small / -large — paid API; the default at many firms for fast prototyping.
Cohere embed-english-v3 — competitor, strong on retrieval.
Voyage-finance-2 — finance-domain-tuned; outperforms generic embeddings on 10-K classification.
BiomedCLIP — medical-imaging-tuned, used in radiology pipelines.
CLIP (OpenAI) — joint image-text embeddings; the foundation of modern visual search.

Below is a toy embedding built with TF-IDF + Truncated SVD (a 1980s technique that is the conceptual ancestor of modern transformer embeddings). The pipeline is identical to a real one — only the embedding block is swapped out.

Module reference — sklearn text → vector

sklearn.feature_extraction.text.TfidfVectorizer(max_features=k, stop_words='english') — term-frequency / inverse-document-frequency. Sparse vectors of size vocab.
sklearn.decomposition.TruncatedSVD(n_components=d) — reduce to d dense components (latent-semantic analysis).
For real transformer embeddings outside Pyodide: pip install sentence-transformers; then SentenceTransformer('all-MiniLM-L6-v2').encode(texts) returns a (n, 384) numpy array.

With a real transformer embedding (768 dimensions, trained on billions of sentences) the same clustering step would produce significantly cleaner clusters, and the same downstream code would be the production pipeline. This is the practical payoff: master the classical pipeline once, then swap the front-end embedding block as better models appear.

Embeddings as regression features

Once you have an embedding matrix \(E \in \mathbb{R}^{n \times d}\), you can use the columns of \(E\) as inputs to any of the methods in Chapter 2. The example below predicts a continuous target from sentence embeddings using LASSO with cross-validation — exactly the workflow used in production alternative-data pipelines.

Vector Databases and Retrieval-Augmented Generation (RAG)

Once you can embed text, images, or any unstructured object into a fixed-dimensional vector, the natural next question is how do I find the nearest vectors at scale? A vector database stores millions to billions of embeddings together with the original document IDs and metadata, and indexes them so that approximate nearest-neighbour (ANN) queries return the top-\(k\) most-similar vectors in milliseconds. This infrastructure layer is what makes embeddings genuinely useful in production.

The leading commercial and open-source options as of 2026:

Pinecone — managed, the production default at many AI-first companies.
Weaviate — open-source, hybrid (vector + keyword) search.
Qdrant — Rust-based, popular in self-hosted setups.
Milvus — open-source, the choice for very large indexes.
Chroma — lightweight, dev-friendly; common for prototyping.
FAISS (Facebook AI Similarity Search) — the low-level ANN library most other databases build on top of.
pgvector — PostgreSQL extension for embeddings; the right choice when you already have a PostgreSQL stack.

The dominant ANN algorithm is HNSW (Hierarchical Navigable Small World, Malkov & Yashunin 2018) — a graph-based index that achieves sublinear query time with very high recall. For huge indexes, IVF-PQ (inverted-file product-quantisation) trades a small accuracy loss for 10-100× memory savings.

Retrieval-Augmented Generation (RAG)

The single most important application of vector databases in 2024–2026 is Retrieval-Augmented Generation. The recipe:

Chunk a knowledge base (documents, code, manuals, financial filings, EHRs) into passages.
Embed each passage and store in the vector DB with metadata.
At query time, embed the user’s question; retrieve the top-\(k\) most similar passages.
Pass the retrieved passages + the question to an LLM as context.
The LLM answers using the retrieved evidence, citing sources.

RAG is the standard architecture for every customer-support chatbot grounded in a product manual, every legal-research assistant grounded in case law, every internal “ask the data lake” tool at a large firm, and every medical-decision-support system grounded in clinical guidelines. The advantage over fine-tuning the LLM directly is updatability: change the knowledge base, the answers change next query, no retraining required.

Module reference — RAG stack

Outside Pyodide a minimal RAG pipeline: - Embeddings: sentence_transformers.SentenceTransformer(...) or OpenAI text-embedding-3-small. - Vector DB: chromadb.Client() for local, pinecone.Index(...) for managed. - Retrieval: collection.query(query_embeddings=q, n_results=5). - Generation: pass retrieved passages to openai.chat.completions.create() or any LLM. - Framework: langchain, llama_index, or haystack orchestrate the chain.

The cell below builds a tiny RAG-style index in pure NumPy: a corpus of 12 “policy snippets,” compute TF-IDF embeddings, index them, and answer a free-form question by retrieving the top-3 most similar snippets. In production you swap the TF-IDF for a transformer embedding and the linear scan for FAISS — the surrounding logic is identical.

The retrieval is precise on every query — the right policy snippet rises to the top. In a full RAG system the retrieved snippets are then passed to an LLM with a prompt template like “Answer the user’s question using only the policy snippets below; cite by line number.” The LLM’s answer is grounded in retrieved evidence and is easy to audit — a property pure-LLM answers do not have.

Hybrid retrieval

Pure vector retrieval misses queries that depend on rare keywords (product SKUs, error codes, drug names). Hybrid retrieval combines vector similarity with classical BM25 or keyword scoring. The combined ranking is often weighted with a reciprocal-rank-fusion rule. Production RAG systems are almost always hybrid.

Cross-Modal Embeddings and Domain Fine-Tuning

Domain fine-tuning

Off-the-shelf embeddings (OpenAI, MiniLM, mpnet) are trained on general web text. They underperform on specialised domains — medical notes, legal contracts, financial filings — because the vocabulary distribution is different and similarity needs to be judged differently. Three production approaches:

Use a pre-fine-tuned domain model — FinBERT, BioBERT, LegalBERT, Voyage-finance-2. Free or paid; immediately drops in.
Fine-tune yourself with contrastive learning — gather pairs of “should-be-similar” and “should-be-different” examples; use sentence-transformers’ MultipleNegativesRankingLoss. Two days of work; gives the best results.
Train a thin adapter — keep the base model frozen and learn a small projection on top. Cheaper than full fine-tuning; useful for narrow domains.

Empirical rule of thumb: a domain-tuned embedding outperforms a generic one by 10-20 retrieval-recall points on the target domain, often with as few as 5,000–50,000 in-domain training pairs. The investment is small; the lift is real.

LLM-Aided Hypothesis and Feature Discovery

Large language models have become a research-pipeline tool at every serious analytics team — not as the model, but as the front end. The three places they show up:

Hypothesis discovery. Given a description of a dataset and a research goal, ask the LLM to propose candidate predictors. The output is a list of features to engineer, often with rationale and a sketched Python implementation. The analyst then engineers, tests, and discards.
Feature extraction from text. Given an earnings-call transcript, a clinical note, or a customer review, prompt the LLM to extract structured fields (“guidance change,” “symptom severity,” “purchase intent”) into a JSON. The JSON columns become features.
Code review and bug-catching. LLMs are remarkably good at spotting look-ahead bias, off-by-one errors, and selection bugs in research code. Many large firms now use internal LLMs as a mandatory first review for any model submitted into production.

The discipline of every classical chapter still applies — especially the multiple-testing correction from Chapter 1. An LLM can hand you 200 plausible candidate features; all 200 must pass the BH gauntlet before any of them is allowed to drive a decision.

Production architecture (text)

Generic pipeline at a representative analytics team: 1. Ingest text → headline / transcript / report text + metadata. 2. Entity-link each text to the firm ontology (NER + ontology lookup). 3. Embed the text with a domain-tuned transformer. 4. Score or classify via a fine-tuned model on top of the embedding. 5. Aggregate to a per-entity, per-day feature vector. 6. Decide through a classical Chapter-2 regression that uses the new features alongside ~50 traditional factors.

The example below replaces the LLM call with a deterministic synthetic-LLM function — a regex that pretends to extract the same fields — so the workflow is runnable in the browser. The downstream regression is identical to a production one.

The point is structural: the LLM does one specialised job — turning unstructured natural language into structured columns. Everything downstream is the regression and validation discipline of the previous chapters. Strip out the LLM and the workflow doesn’t function; strip out the regression and the LLM output cannot be used to decide anything.

Chapter Wrap-up

This chapter covered the modern AI front-end as a set of feature-engineering methods for statistical pipelines: turning text and images into vectors (embeddings), retrieving similar vectors at scale (vector databases, RAG), exploiting relational structure with graph neural networks, and letting LLMs extract structured fields from unstructured documents. None of these replace the classical statistical machinery of Chapters 1–7. They feed it new kinds of input.

Two operational rules that apply regardless of the platform:

Embedding drift. A transformer trained in 2022 will produce different embeddings of the same sentence in 2026 after a version upgrade. Lock the embedding model version, just as you lock numpy / sklearn versions.
LLM hallucination in feature extraction. An LLM asked to extract structured fields will occasionally invent values that aren’t in the text. Require structured-output (JSON-schema) prompts and human spot-check at least 1% of extractions.

The architectural questions — how the resulting features attach to a domain ontology, how lineage is tracked, how the whole pipeline is governed — belong to the practitioner’s discipline of domain modelling and are covered in the companion volume Domain Modelling in Python. This book treats those topics as out of scope and concentrates on the statistical methods.

Chapter 9 covers the rest of the AI-driven toolkit: foundation models for time series, causal AI at ML scale, and symbolic regression — methods that operate on the embedded, structured feature matrix produced by this chapter.

← Chapter 7 · Contents · Chapter 9: Foundation Models, Causal AI & Symbolic Regression →