Vector Storage in LLMs: A Deep Technical Analysis

by ✨ OPUS4i 5mo ago

Comparison of Top 5 Vector Storage Solutions

Large Language Model (LLM) applications often rely on vector databases or indexes to store and retrieve high-dimensional embeddings. Below we compare five leading vector storage solutions used with LLMs – Pinecone, Weaviate, ChromaDB, FAISS, and Vespa – focusing on their features, scalability, latency, indexing methods, integrations, and pricing:

Pinecone

Type: Fully-managed cloud vector database (closed-source; no local or on-prem deployment) (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more). Designed as a service for ease of use and performance.
Scalability: Built to scale horizontally and vertically with a distributed architecture (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more). Handles billions of vectors with consistently low query latency (Pinecone Vector Database ~ AnythingLLM). (Pinecone “serves fresh, filtered query results with low latency at the scale of billions of vectors.” (Pinecone Vector Database ~ AnythingLLM))
Latency: Optimized for real-time similarity search. Even with hundreds of millions of vectors, query latencies remain low (on the order of milliseconds) due to Pinecone’s proprietary indexing and caching techniques (February Release: Performance at Scale, Predictability, and Control) (Pinecone Vector Database ~ AnythingLLM).
Indexing: Uses a proprietary graph-based ANN index (often described as the “Pinecone graph index”) supporting cosine, dot-product, and Euclidean distances (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more). The details are abstracted from users, but it automatically optimizes indexing for performance.
Integrations: Offers easy-to-use SDKs (Python, Node.js, Go, Java, etc.) and a high-performance gRPC and REST API (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more). Integrates well with LLM frameworks like LangChain for retrieval-augmented generation. Supports metadata filtering using a MongoDB-like query language on vector metadata (up to 40KB of metadata per vector) (Pinecone Vector Database ~ AnythingLLM).
Features: Provides namespace isolation and collections for organizing vectors. Ensures data is encrypted in transit and at rest (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more). Fully managed (no DevOps overhead) and recently introduced a serverless usage model for on-demand scaling (Vector Databases 101: What are Vector Databases? - Vation Ventures).
Pricing: Usage-based pricing. There’s a free tier for development, and paid plans are billed per pod (compute unit) hour and vector storage. For example, the standard tier starts around $0.096/hour per pod (Pricing - Pinecone) (which translates to about $70/month per pod), with larger pods or higher-performance tiers costing more. Backup storage and other enterprise features are additional (Pricing | Pinecone).

Weaviate

Type: Open-source vector database (self-host or managed cloud). Available as a standalone OSS deployment (Docker, Kubernetes) and as Weaviate Cloud Service (WCS) for a fully-managed solution (Weaviate vs Pinecone | Zilliz).
Scalability: Supports both horizontal scaling (sharding across nodes) and vertical scaling. In self-hosted setups, you can shard data to handle billions of vectors across a cluster. Weaviate is designed for high throughput and can be scaled to production workloads of considerable size (it’s proven to handle enterprise-scale indexes) (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more).
Latency: Offers fast, low-latency vector searches by combining vector indexes with structured filtering (Weaviate vs Pinecone | Zilliz). Typical query times are milliseconds for millions of vectors when using indexes like HNSW on adequate hardware.
Indexing: Uses the HNSW (Hierarchical Navigable Small World) graph for ANN by default (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more). Weaviate indexes every object’s embedding for similarity search and can also perform hybrid searches (combining vector similarity with keyword filters or other criteria). Distance metric support includes cosine similarity (default), with dot product and Euclidean available through configuration.
Integrations: Provides a powerful GraphQL API for queries, as well as a RESTful API. Official client SDKs exist for Python, JavaScript/TypeScript, Java, Go, .NET and more (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more). Weaviate’s GraphQL query interface allows combining vector queries (nearVector/nearText clauses) with filters in a single request. It also integrates with machine learning modules – e.g. built-in vectorizers for text/images or the ability to bring your own embeddings.
Features: Weaviate “combines object and vector storage, enabling efficient vector searches with structured filtering” (Weaviate vs Pinecone | Zilliz). This means you can store additional properties (structured data) alongside vectors and filter results by fields (e.g., date, category) in queries. It supports batch inserts, data updates, and deletion. Schemas are flexible, and you can have multiple classes (tables) each with their own vector index. It also supports multi-modal data (text, image, etc.) by using appropriate encoders. Security features include authentication, authorization, and data encryption in transit for the cloud service.
Pricing: Free to use self-hosted (Apache 2.0 License). The managed Weaviate Cloud has a serverless pricing model based on vector dimensions stored and an SLA tier. For instance, it starts at about $25/month per 1M vector dimensions stored (Vector Database Pricing | Weaviate), with higher tiers for enterprise needs. A free sandbox tier is available for testing. This usage-based model (approximately $0.05 per million vector dimensions) means you pay for what you index, scaling cost with data size (Vector Database Pricing | Weaviate) (Vector Database Pricing | Weaviate).

ChromaDB

Type: Open-source “AI-native” embedding database. Chroma (often called ChromaDB) is a lightweight vector store designed for LLM applications and easy integration. It’s available as a Python package (and via a REST server if self-hosted behind an API). There is no fully-managed service yet (one is in development), but it’s trivial to run anywhere (even embedded in a Python app) (Chroma).
Scalability: Chroma is built for simplicity and developer-friendliness. It works well for local or small-scale deployments (e.g. thousands to millions of embeddings on a single machine). It can be deployed in a distributed way (by manually sharding data across multiple instances), but out-of-the-box it’s a single-node datastore. It stores data on disk (persistent) but keeps indexes in memory for speed. For very large scale (billions of vectors), other solutions might be more appropriate, but Chroma covers many LLM use cases where data sizes are more modest or can be segmented.
Latency: Offers millisecond-level query latency for similarity searches on moderate-sized datasets. Since it uses an in-memory HNSW index, queries are fast. The trade-off is that extremely large datasets might exceed memory; for those, you’d either use partial indexes or a different system. For most applications with up to tens of millions of vectors, Chroma can retrieve nearest neighbors quickly.
Indexing: Uses HNSW for approximate nearest neighbor search under the hood (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more). Chroma supports cosine, dot product, or Euclidean distance metrics for similarity comparisons (Vector Databases Compared: Pinecone, Milvus, Chroma, Weaviate, FAISS, and more). It also has built-in support for full-text search on documents and metadata filtering, which differentiates it from pure vector-only stores. In essence, Chroma can index text both as vectors and as keywords, enabling hybrid queries (this “batteries included” approach means you can do vector search, lexical search, and metadata filtering all in one place) (Chroma).
Integrations: Chroma is made to integrate seamlessly with LLM tooling. It’s the default vector store in some frameworks like LangChain. It provides simple APIs in Python (and experimental JavaScript bindings) for adding embeddings and querying. You can use it in-memory during an application’s runtime or launch a Chroma server. Because it’s Python-based, integration with PyTorch, Hugging Face transformers, or other ML pipelines is straightforward.
Features: As the Chroma homepage states, “Chroma is the open-source AI application database. Embeddings, vector search, document storage, full-text search, metadata filtering, and multi-modal – all in one place.” (Chroma). You can store not just vectors but also the original content (text, etc.) and any metadata. Multi-modal support means you could store embeddings from text, images, etc., in the same DB. However, unlike some larger systems, Chroma does not (yet) support user-defined schemas or multiple collections with different structures – it’s more of a simple collection of embeddings and documents.
Pricing: Free and open-source (Apache 2.0) for all uses (Chroma). Running Chroma yourself incurs only your infrastructure costs. A hosted Chroma service is not yet generally available (as of 2025, users can join a waitlist for a managed cloud offering). In short, Chroma’s cost is minimal – it aims to be the “fastest way to build LLM apps with memory” (chroma-core/chroma: the AI-native open-source embedding database) without introducing new commercial barriers.

FAISS

Type: Library (not a server) for efficient similarity search, developed by Facebook AI (Meta). FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings, widely used to implement vector search inside custom applications (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.). It’s not a standalone database – rather, you embed it in your code or service. Many vector databases internally use FAISS components for indexing.
Scalability: Extremely scalable on a single machine. FAISS can handle datasets of billions of vectors by using compression techniques and/or multi-shard setups (Faiss: A library for efficient similarity search - Engineering at Meta). For example, FAISS notes that storing 1 billion 128-d vectors (which is ~32 bytes each if optimized) in RAM is feasible with compression (Faiss: A library for efficient similarity search - Engineering at Meta). It also supports multi-GPU acceleration, meaning you can distribute an index across several GPUs to handle larger data and speed up queries (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.). However, FAISS itself doesn’t handle clustering multiple machines – if you need a distributed solution across nodes, you have to build that logic (or use an existing wrapper that shards FAISS indexes).
Latency: Known for high performance. FAISS offers sub-millisecond search times for nearest neighbors on million-scale datasets using approximate methods. It’s optimized in C++ and can use CPU SIMD instructions and GPUs to accelerate search. With the right index type (and parameter tuning for recall vs. speed), FAISS can answer queries far faster than brute-force search while maintaining high accuracy (Faiss: A library for efficient similarity search - Engineering at Meta) (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.). Many benchmarks show FAISS as a top performer for ANN search on a single node.
Indexing: Rich choice of indexing methods. FAISS provides multiple algorithms: flat indexes (exact search), IVF (Inverted File) indexes that partition the vector space, HNSW graphs, PQ (Product Quantization) and OPQ (Optimized PQ) for compression, IVF+PQ combinations, IMI (Multi-index hashing) for very large scale, etc. (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.). Developers can choose an index type based on their needs (trade-off between speed, memory, and accuracy). For example, one can use IndexFlatL2 for small data (exact search), or an IndexIVFPQ for billions of vectors with limited memory. FAISS also supports a range of distance metrics: Euclidean (L2), inner product (IP), Cosine (via normalized vectors), etc. (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.). This flexibility makes FAISS a kind of building block for vector search.
Integrations: Because FAISS is a lower-level library, integration means writing code to use it. Many Python developers use the faiss Python module to create and query indexes in-memory. Some vector DBs (like Milvus or Zilliz) have historically used FAISS for certain index types under the hood. If you use FAISS directly, you’ll need to manage persistence (saving index to disk) and possibly handle concurrent queries (FAISS itself is thread-safe for searching, but you must ensure the index is in memory). It doesn’t offer an out-of-the-box REST API – you integrate it into your service’s API if needed.
Features: Optimized for memory and speed – FAISS can even store indices on GPU memory for extremely fast retrieval (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.). It includes utilities for training vectors (e.g., k-means for clustering centroids in IVF) and for evaluating search recall. One highlight: “Faiss focuses on methods that compress the original vectors, because they’re the only ones that scale to data sets of billions of vectors.” (Faiss: A library for efficient similarity search - Engineering at Meta) It can reduce vectors to byte codes (through PQ or binary hashing) to drastically lower memory usage, at some cost to precision. FAISS also supports adding and removing vectors (though dynamic updates on some index types may require reindexing or have performance costs).
Pricing: Free (BSD-licensed open source). There is no commercial tier; you run it on your own hardware. The cost is essentially whatever compute resources you need (CPU/GPU and RAM for your indexes). Using FAISS might save costs compared to a managed service if you already have infrastructure, but engineering effort is required to maintain it in production (for example, handling failover, sharding, etc., if needed).

Vespa

Type: Open-source search engine platform by Yahoo (now maintained by Vespa AI). Vespa is a mature system that supports both vector search and traditional keyword search, plus advanced features like filtering, aggregation, and custom ranking. It’s available open-source (Apache 2.0) and via Vespa Cloud (a managed service). Vespa has been used in large-scale applications for about two decades (originating from Yahoo’s search and recommendation systems) (Vector database feature comparison) (Vector database feature comparison), making it a proven solution at web scale.
Scalability: Designed for massive scale and high concurrency. Vespa can serve very large indexes (billions of documents/vectors) across distributed clusters. It automatically manages data partitioning and replication. It supports online scaling (adding nodes to a running cluster to redistribute load) with minimal downtime. In fact, Vespa has “20 years of experience serving workloads involving AI and big data, online at large scale.” (Vector database feature comparison)* and is used in production for scenarios handling hundreds of thousands of queries per second.
Latency: Optimized for real-time serving. Vespa was built to power user-facing search and recommendation features (e.g., Yahoo’s mail search, news feed personalization), so it emphasizes low latency query execution even under heavy load. It achieves this with efficient C++ components and smart query planning (e.g., skipping unnecessary computations, distributing queries in parallel to partitions). Even complex queries (combining text and vector similarity, with filters) are executed within tens of milliseconds in many cases.
Indexing: Supports both exact and approximate vector search. Vespa implements HNSW for ANN indexing of vectors (configurable per vector field) (Approximate Nearest Neighbor Search using HNSW Index). You can choose to use approximate=true (HNSW) for faster results or approximate=false for exact brute-force search on smaller data. Uniquely, Vespa allows combining nearest-neighbor search with other query operators: you can require boolean conditions, do text matching, and simultaneously compute vector similarity in one query (Nearest Neighbor Search). This flexible query orchestration is a standout feature. It also supports multiple vector fields per document, each with its own index, and even multiple query vectors in a single query. In addition, Vespa has built-in support for BM25 text indexing, filtering on structured fields, and more – truly a hybrid engine (Vector database feature comparison) (Vector database feature comparison).
Integrations: Vespa exposes a RESTful Search API and a Document API (for feeding data). You typically configure a JSON application package (schema) and deploy it to a Vespa cluster. For developers, pyvespa (Python SDK) helps with deploying an application and querying it programmatically. Vespa isn’t as plug-and-play in most LLM frameworks out-of-the-box, but it can be used in retrieval pipelines for LLMs (there are examples integrating Vespa with LangChain and other RAG setups). Think of Vespa as an engine you run (or have managed in Vespa Cloud) and then query via HTTP/JSON. It supports advanced query language (YQL) for crafting complex queries, which can be a learning curve but offers great power.
Features: Hybrid search capabilities (combine keyword and vector search) are first-class. For example, you can ask Vespa to "find nearest neighbors to this embedding among documents that match a filter or keyword". It supports real-time updates (ingesting or updating documents on the fly, with immediate query availability) (Vector database feature comparison). It also has features like custom ranking (you can deploy a TensorFlow or XGBoost model inside Vespa to re-rank results), aggregation/faceting on result sets, and more (Vector database feature comparison). Vespa can serve as a one-stop solution for building semantic search with filtering and business rules. Security-wise, being self-hosted, it relies on network security and any custom auth you put in front (Vespa Cloud would handle encryption and access control for you).
Pricing: Open-source and free to run on your own hardware. If using Vespa Cloud (managed), pricing is resource-based: you pay for the vCPU hours, GB of memory, storage, etc., that your deployment uses (Vespa Cloud Pricing) (Vespa Cloud Subscription - AWS Marketplace - Amazon.com). Vespa Cloud provides a $300 free credit for new users (Free trial - Vespa.ai). Because Vespa is quite powerful, running it at scale might involve more resources than a simpler vector DB, but it can replace a combination of systems (vector search + keyword search, etc.) in one. For experimentation, you can also use Vespa’s free trial or run a small Docker container locally at no cost.

Summary of Key Differences: In summary, Pinecone and Weaviate are popular choices tailored for ease of use with LLMs – Pinecone if you prefer a fully managed, plug-and-play service, and Weaviate if you want open-source flexibility or cloud with more control. ChromaDB is the developer-friendly open-source library great for prototyping and moderate-scale LLM apps. FAISS is the go-to for those needing maximal performance and customization on a single machine or who want to embed vector search logic into their own system. Vespa is a heavyweight solution offering hybrid search and enterprise-scale capabilities (vector + text), suitable when you need a powerful search engine underpinning your LLM application. The best choice depends on factors like scale, budget, integration needs, and whether you require advanced filtering or hybrid queries.

Vector Storage Use Cases in LLMs

Vector databases unlock several important capabilities for LLM-based applications. By storing embeddings (vector representations) of text or other data, they allow LLMs to retrieve relevant information efficiently and overcome some inherent limitations of the models. Key use cases include:

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation is a technique that combines an LLM with an external knowledge base. When a user asks a question, the system first retrieves relevant information (by embedding the query and doing a similarity search in a vector store of documents), and then feeds that information into the LLM to generate a more informed answer (Vector databases · Vectorize). The vector database is the core of this retrieval step – it enables semantic search over the knowledge base. By using embeddings, RAG can fetch information that is conceptually relevant to the query, even if the wording differs (unlike keyword search).

Using RAG has two big benefits for LLMs: it expands the effective knowledge of the model and reduces hallucinations. The LLM doesn’t have to rely solely on its trained parameters (which may be outdated or limited); it can pull in up-to-date facts or domain-specific data from the vector store. This grounding in real data means the LLM’s responses are more likely to be correct and traceable to a source (What is RAG: Understanding Retrieval-Augmented Generation) (Vector databases · Vectorize). For example, a company can feed all its internal documents or a product knowledge base into a vector DB. When a question comes, the system retrieves the top-n relevant chunks and prompts the LLM with those, so the answer will reference that external knowledge.

How it works: Typically, developers embed each document (or paragraph, or chunk of text) in the knowledge base into a high-dimensional vector and store those in the vector DB with the text as metadata. At query time, the user’s question is also embedded (using the same model), and the vector DB is queried (ANN search) to get, say, the top 5 closest document vectors. Those corresponding texts are fetched (the vector DB can store or link to the texts) and appended to the LLM’s prompt (often with a format like a QA context). The LLM then generates its answer using both the prompt question and the retrieved context. This process effectively gives the LLM a large external memory of facts to draw from each time (Vector databases · Vectorize).

Real-world applications: RAG is used for chatbots that can cite knowledge, customer support assistants, open-domain QA systems, etc. For instance, Bing’s chat and other search-engine LLMs use a form of RAG – retrieving web pages by vector/keyword hybrid search and then answering questions. Internal enterprise QA bots use RAG to safely provide answers based on company documents. Developers often choose Pinecone, Weaviate, or Chroma for this use case, due to their ease of integration with frameworks like LangChain. The result is an LLM that is aware of an external knowledge source and can provide answers with references, dramatically improving reliability over vanilla LLM responses.

Memory Augmentation (Long-term Memory for LLMs)

LLMs have a context window limitation – they can only “remember” a certain amount of text from the conversation or instructions (maybe a few thousand tokens for GPT-3.5, up to 100k tokens in new models). Vector storage can serve as an external long-term memory for an LLM, allowing it to recall information from earlier in a conversation or from a user's history even if that data is no longer in the immediate context. In other words, the vector database can function as the LLM’s semantic memory: store embeddings of dialogue turns or documents, and fetch relevant pieces later when needed.

This is vital for building conversational agents that interact over long periods. For example, imagine a personal AI assistant that you’ve been talking to for weeks. It can’t fit all past dialogues into the prompt every time (that would be huge), but it can store vector embeddings of past interactions and retrieve the most relevant past facts when the user brings up a related topic. This technique allows continuity of conversation and personalization. Research has noted that vector databases can address LLMs’ lack of long-term memory by acting as an extension of the model’s memory (From prototype to production: Vector databases in generative AI applications - Stack Overflow). Essentially, whenever the conversation moves on, older dialogue turns are embedded and stored; when context is needed, the agent queries the vector store (with the current conversation state as query) to remember “what was previously said about this topic.”

Memory augmentation via vectors is not limited to dialogue. It can also mean an LLM agent that accumulates knowledge over time. For instance, an LLM-based research assistant could embed and store all the documents or articles it has read. Later, when asked a question, it can retrieve snippets from everything in its “memory” that might be relevant and use them to formulate an answer. This concept is sometimes called LLM memory or the “knowledge vault” approach.

Technical details: Implementing long-term memory often involves deciding how to chunk and embed conversational data. One common approach is to periodically summarize or chunk the conversation and store those summaries as embeddings (to reduce total entries). When retrieving, the system might embed the current conversation state or the last user query and look for similar vectors in the memory store (which would surface semantically related past conversations or notes) (Vector Storage Based Long-term Memory Research on LLM - Sciendo) (Vector Storage Based Long-term Memory Research on LLM - Sciendo). The retrieved items can then be fed into the prompt (similar to RAG) so the model can refer to them. Alternatively, some systems classify the conversation into topics and use vectors to fetch relevant facts on that topic that were previously mentioned.

Vector databases are well-suited for this because they enable fast similarity search through potentially thousands of past interactions to find the one or two that are relevant to the current context. This use of a vector DB as associative memory helps maintain coherence in extended dialogues. As noted in a Stack Overflow article, “vector databases have shown that they can enhance LLM capabilities by acting as an external memory.” (From prototype to production: Vector databases in generative AI applications - Stack Overflow) When done correctly, the user gets the impression that the LLM “remembers” earlier details or user preferences even if those were from far earlier in the interaction.

Efficient Similarity Search and Semantic Indexing

At its core, a vector store’s primary capability is similarity search: given a query vector, find the most similar vectors (and thus similar items) in the database. Many LLM-driven applications need this kind of semantic search. Beyond the RAG and chat memory scenarios above, there are other use cases where you want to find data that semantically matches a query:

Semantic Search Engines: Replacing or augmenting keyword search with embedding-based search. For example, a documentation search feature where user queries and documents are vectorized so that the search can find conceptually related docs, not just exact keyword matches. This is useful in enterprise search or any situation where relevant information might use different wording. Vector DBs enable these natural language searches on your data. They “retrieve objects based on similarity” (rather than exact text matching) (From prototype to production: Vector databases in generative AI applications - Stack Overflow), so a search for “how to fix login issue” could find a knowledge base article that doesn’t contain those exact words but is about troubleshooting sign-in problems.
Recommendation Systems: If you embed users and products (or content) into vector space, you can use nearest neighbor search to recommend similar items. For instance, given an article a user liked (represented as an embedding), find other articles with nearby embeddings to recommend next. While not an LLM generating text, this is a common vector similarity use case. In LLM contexts, you might use this to suggest next prompts or relevant tools based on an embedding of the conversation state.
Clustering and Classification: You can classify data by finding which cluster centroid or labeled example is nearest in vector space. An LLM might produce an embedding of some input text and then use a vector DB to quickly find the closest label embedding (zero-shot classification via similarity). This is faster than comparing against every possible label in real-time if the number of labels is large.
Anomaly or Novelty Detection: By periodically embedding new data (say, transactions, logs, etc.) and comparing to stored historical embeddings, you can detect if something is semantically different from past data (e.g., an error message that doesn’t resemble any seen before). This is a more niche use with LLMs but potentially useful for monitoring when LLMs are producing outputs far from the norm, etc.

The common theme is that vector databases enable fast approximate $k$-nearest neighbor (ANN) search which is a fundamental building block in these applications. Doing a brute-force similarity comparison against millions of vectors for each query would be prohibitively slow, but with indexes like HNSW or IVF, a good vector DB can retrieve nearest neighbors in milliseconds. In fact, vector DBs are “optimized to conduct lightning-fast vector searches at scale” (From prototype to production: Vector databases in generative AI applications - Stack Overflow), making things like real-time semantic search possible in user-facing applications.

For example, Cloudflare’s developer docs note that without a vector index, you’d have to compare a query embedding to every data embedding (which is “neither practical nor efficient” when data is large) (Vector databases · Vectorize). Vector stores solve that by doing clever index lookups instead of raw comparisons. The result: queries that might have taken seconds or more with brute force can often run in under 50 milliseconds on a good ANN index – a necessity for interactive apps.

In LLM-based pipelines, efficient similarity search also means you can do things like tool retrieval (choose which tool or function an LLM should use based on embedding the conversation and available tool descriptions), or example retrieval for few-shot learning (find the most relevant examples to include in a prompt from a database of examples). Essentially, any time an LLM needs to pick from or look up information in a large collection, vector search is the go-to method.

To summarize, vector storage enables LLM systems to store knowledge, memory, and examples in embedding form, and retrieve them quickly by similarity. This powers RAG (enhancing generation with outside facts), gives long conversations continuity, and allows semantic searching and matching at scales that would be infeasible with naive approaches. The integration of a vector database is thus a key architectural choice in many advanced LLM applications, often determining how well the system can handle large knowledge and maintain performance.

Basic Code Examples

To solidify the understanding, here are simple code snippets demonstrating how to use some popular vector storage frameworks for storing and querying embeddings. These examples use Python APIs for FAISS, Pinecone, Weaviate, ChromaDB, and Vespa. (In practice, you would need to install the respective client libraries and have any required services running. The code is illustrative.)

Using FAISS for Vector Similarity Search (Python)

FAISS can be used to create an in-memory index of vectors and query it for nearest neighbors. In this example, we create a FAISS index for 128-dimensional vectors and perform a query for the 5 most similar vectors to a given query vector:

import numpy as np
import faiss  # Make sure faiss is installed (e.g., pip install faiss-cpu)

# Create some sample data: 1000 vectors of dimension 128
dim = 128
data_vectors = np.random.random((1000, dim)).astype('float32')
query_vector = np.random.random((1, dim)).astype('float32')

# Build a FAISS index (Flat index for L2 distance)
index = faiss.IndexFlatL2(dim)          # an exact index (brute-force)
# For larger scale, you could use IndexHNSWFlat or IndexIVFPQ for ANN.
print("Is trained?", index.is_trained)  # True for flat index (no training needed)
index.add(data_vectors)                 # add vectors to the index

# Search the 5 nearest neighbors of the query vector
k = 5
distances, indices = index.search(query_vector, k)
print("Nearest neighbor indices:", indices[0])
print("Distances:", distances[0])

Explanation: We used IndexFlatL2, which computes exact Euclidean distances. FAISS also supports other indices like IndexHNSWFlat(dim, M) for HNSW or IndexIVFPQ (which requires a training step). After adding vectors, search returns the indices of the closest vectors and their distances. In a real scenario, you’d retrieve the actual data associated with those indices (which you’d store separately or as part of the vector payload).

Using Pinecone Vector Database (Python API)

Using Pinecone involves creating an index in the Pinecone service and then upserting and querying vectors. This example assumes you have a Pinecone account and an API key:

!pip install pinecone-client  # install the Pinecone client library
import pinecone

# Initialize Pinecone (replace with your API key and environment)
pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gcp")

# 1. Create a new index (if not already created)
index_name = "example-index"
pinecone.create_index(name=index_name, dimension=128)  # 128-dim vectors

# 2. Connect to the index
index = pinecone.Index(index_name)

# 3. Upsert vectors into the index
vectors_to_upsert = [
    ("vec1-id", [0.1, 0.2, 0.3, ... 0.128]),   # 128-dimensional list
    ("vec2-id", [0.4, 0.2, 0.1, ... 0.05])     # (using dummy values for illustration)
]
index.upsert(vectors=vectors_to_upsert)

# 4. Query the index with a new vector
query_vector = [0.1, 0.2, 0.25, ... 0.0]  # a 128-dim query embedding
result = index.query(vector=query_vector, top_k=3, include_metadata=True)
print(result)

Explanation: After initialization, we created an index named "example-index" with vector dimension 128. We then connected to that index and upserted two vectors (each with an ID). Finally, we queried the index for the top 3 nearest neighbors to a query vector. The result will typically include the IDs of the nearest vectors (and any metadata if stored). Pinecone manages the index behind the scenes – if the index already exists, the create_index call can be skipped or will error out. You can also specify metadata or use Pinecone’s filtering in the query (e.g., filter={"genre": "tech"} to only search a subset). Note that ... in the vector list is just illustrative; in actual code you’d have a full list of 128 floats.

Using Weaviate (Python client with REST/GraphQL)

Weaviate can be self-hosted (e.g., via Docker) or accessed through Weaviate Cloud. The Python client wraps Weaviate’s REST/GraphQL API. In this example, we create a schema, add a data object with an embedding, and perform a vector search using the near_vector filter:

!pip install weaviate-client
import weaviate
import numpy as np

# Connect to Weaviate (assuming a local instance running on port 8080)
client = weaviate.Client("http://localhost:8080")

# 1. Define a schema (class) for our data if not already present
schema = {
    "classes": [
        {
            "class": "Document",  # class name
            "properties": [
                { "name": "content", "dataType": ["text"] }
            ]
            # (Weaviate will automatically have a vector space for this class)
        }
    ]
}
# Create the schema (if the class doesn't exist yet)
client.schema.create(schema)

# 2. Add a document object with an embedding vector
doc_properties = { "content": "Hello world" }
vector = np.random.random(512).tolist()  # example 512-dim embedding
client.data_object.create(
    data_object=doc_properties, 
    class_name="Document", 
    vector=vector, 
    uuid="doc1"
)

# 3. Query the nearest neighbor to a given vector
query_vector = np.random.random(512).tolist()
result = client.query.get("Document", ["content", "_additional { vector, distance }"]) \
            .with_near_vector({"vector": query_vector}) \
            .with_limit(1).do()
print(result)

Explanation: We defined a Document class with a content field. When we add objects to Weaviate, we can supply our own vector (since we disabled the built-in vectorizer by not specifying one, Weaviate expects us to provide the embedding). We added one document with a random 512-dim vector for demonstration. The query uses the GraphQL Get { Document (...) } under the hood. We specified _additional { vector, distance } to return the stored vector and distance for demonstration, but typically you might just request content or other metadata. The with_near_vector({"vector": query_vector}) part is the key: it tells Weaviate to perform a vector similarity search using the provided query vector. The result will contain the nearest Document (in this case, essentially the one we inserted since we only had one) and the distance. In practice, you’d have many documents and you’d retrieve the top K most similar; you can also use .with_where(filter) to apply filters alongside the vector search, and .with_near_text({"concepts": ["..."]}) if you want Weaviate to handle embedding the query text via an installed module.

Using ChromaDB (Python)

ChromaDB is very straightforward to use. You create a client, create a collection, add embeddings with optional IDs and documents, then query by embedding or by text. Here’s a basic example:

!pip install chromadb
import chromadb

# 1. Create a Chroma client and collection
client = chromadb.Client()
collection = client.create_collection(name="documents")

# 2. Add documents with embeddings to the collection
embeddings = [
    [0.1, 0.2, 0.3],   # embedding for doc1 (3-dimensional for simplicity)
    [0.12, 0.18, 0.24] # embedding for doc2
]
documents = ["This is a test document.", "Another example document."]
ids = ["doc1", "doc2"]
collection.add(documents=documents, embeddings=embeddings, ids=ids)

# 3. Query the collection for similar items to a query embedding
query_emb = [[0.11, 0.2, 0.25]]  # a query vector
results = collection.query(query_embeddings=query_emb, n_results=1, include=['documents', 'distances'])
print("Closest document:", results['documents'][0][0])
print("Distance:", results['distances'][0][0])

Explanation: We created a collection named "documents". We then added two documents each with a 3-dimensional embedding (in real use, embeddings would be high-dimensional, e.g., 384 or 768 dims from an LLM). We provide IDs for each vector as well. The add operation in Chroma stores the vectors, their IDs, the raw documents, and any metadata (here we didn’t specify extra metadata, but we could). For the query, we used query_embeddings directly with a sample vector and asked for the closest match (n_results=1). We also asked to include the actual document text and distance in the result. Chroma will return a structure containing the IDs, documents, distances of the top matches. In this synthetic example, it should return one of the documents we added as the closest. Chroma can also do query_texts=["..."] if an embedding function is set for the collection – it will then embed the text query internally before searching. By default, Chroma uses cosine similarity on the stored embeddings.

Using Vespa (via REST API)

Using Vespa involves more setup (defining a schema and deploying the Vespa application). Here, we'll outline a simple example of feeding a document and querying using Vespa’s HTTP API. (This assumes you have a Vespa instance running with a schema that has a vector field.)

import requests
import json

# Assume a Vespa schema "MyDoc" with a field "embedding" of type tensor<float>(d[3])
vespa_endpoint = "http://localhost:8080"  # endpoint for Vespa

# 1. Feed a document with an embedding
doc_id = "doc1"
document = {
    "fields": {
        "embedding": [0.1, 0.2, 0.3],
        "text": "Sample content for vector search"
    }
}
feed_url = f"{vespa_endpoint}/document/v1/MyDoc/docid/{doc_id}"
resp = requests.post(feed_url, json=document)
print("Feed response:", resp.status_code)

# 2. Perform a nearest neighbor search query for a given query vector
query_vector = "[0.1,0.2,0.25]"  # Vespa expects the vector as a string in query params
yql = "select * from MyDoc where ([{targetHits:5}]nearestNeighbor(embedding, query_vec));"
params = {
    "yql": yql,
    "input.query_vec": query_vector
}
search_url = f"{vespa_endpoint}/search/"
resp = requests.get(search_url, params=params)
result_json = resp.json()
matches = result_json.get("root", {}).get("children", [])
for match in matches:
    score = match.get("relevance")
    doc_fields = match.get("fields", {})
    print(f"Match id={doc_fields.get('id')} score={score} text={doc_fields.get('text')}")

Explanation: In Vespa, we construct a query using YQL (Yahoo Query Language). The query string nearestNeighbor(embedding, query_vec) means: use the vector field embedding in documents and the query tensor query_vec to find nearest neighbors. We set targetHits:5 to retrieve the top 5 results. We pass the query vector via the input.query_vec parameter in the request (Vespa will treat it as the query tensor). The response contains matched documents with a relevance score (which for nearest neighbor is typically based on similarity). We loop through the results and print out the document ID, score, and text. Before querying, we fed a document doc1 with a sample 3-dimensional embedding. In practice, you would deploy a Vespa application with the proper schema (defining that embedding is a tensor of a certain dimension and that a nearest neighbor index should be built on it). The example is a simplification to show the API calls. Vespa’s integration would allow combining this vector query with other conditions (e.g., add and text contains "sample" to the where clause for hybrid search).

Each of these code snippets demonstrates the basic operations: inserting vectors and querying for nearest neighbors. In real LLM applications, these would be integrated such that when a query comes in, it’s embedded (via an encoder model), passed to the vector store which returns IDs or contents, and those are then used by the LLM. Likewise, adding data might be part of a pipeline that processes and embeds documents or conversation turns and stores them.

These frameworks offer more advanced usage (like filtering results by metadata in Pinecone/Weaviate/Vespa, doing batch upserts, adjusting index parameters for performance, etc.), but the above examples cover the core pattern of vector storage and retrieval that underpins LLM use cases like RAG and memory augmentation. Each solution has its own API nuances, but conceptually they all provide a way to add(vector, id, metadata) and query(similar to this vector) – the building blocks for making our language models smarter and more capable by coupling them with vectorized knowledge.

Prepared with OpenAI o1-pro & deep-research