πŸ—„οΈ Database✍️ KhoaπŸ“… 19/04/2026β˜• 17 phΓΊt đọc

Vector Databases: Tα»« Zero Δ‘αΊΏn Senior+

Vector databases lΓ  "người hΓΉng thαΊ§m lαΊ·ng" Δ‘αΊ±ng sau mọi AI application bαΊ‘n dΓΉng hαΊ±ng ngΓ y: ChatGPT search, Spotify recommendations, Google Photos face recognition. NαΊΏu bαΊ‘n Δ‘ang build AI-powered apps, Δ‘Γ’y lΓ  kiαΊΏn thα»©c must-have.

🎯 αΊ¨n dα»₯: Traditional DB nhΖ° tra tα»« Δ‘iển (exact match). Vector DB nhΖ° hỏi "cho tΓ΄i nhα»―ng tα»« cΓ³ nghΔ©a tΖ°Ζ‘ng tα»±" β†’ semantic search.


Level 1: Foundations (New β†’ Junior)

Vector là gì?

Vector = Array of numbers representing meaning/features.

# Text example
"dog" β†’ [0.2, 0.8, 0.1, 0.5, ...]  # 768 dimensions
"puppy" β†’ [0.25, 0.75, 0.15, 0.48, ...]  # Similar to "dog"
"car" β†’ [0.9, 0.1, 0.8, 0.2, ...]  # Different from "dog"

# Image example
πŸ• (image) β†’ [0.34, 0.12, 0.89, ...]
πŸš— (image) β†’ [0.91, 0.05, 0.23, ...]

Key insight: Vectors capture semantic meaning β€” similar concepts have similar vectors.

Embeddings: Tα»« Data β†’ Vector

Embedding = Process chuyển data (text, image, audio) thΓ nh vector.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed text
text = "I love machine learning"
embedding = model.encode(text)
print(embedding.shape)  # (384,) - 384-dimensional vector
print(embedding[:5])    # [0.0234, -0.1234, 0.5678, ...]

Popular embedding models:

  • Text: OpenAI text-embedding-3-small, Cohere Embed, sentence-transformers
  • Images: CLIP, ResNet, ViT
  • Code: CodeBERT, GraphCodeBERT
  • Multimodal: CLIP (text + images)

Similarity Search: Core Operation

Problem: "TΓ¬m 10 items giα»‘ng nhαΊ₯t vα»›i query nΓ y"

import numpy as np

# 3 documents in our "database"
docs = {
    "doc1": np.array([0.2, 0.8, 0.1]),  # "I love dogs"
    "doc2": np.array([0.25, 0.75, 0.15]),  # "Puppies are cute"
    "doc3": np.array([0.9, 0.1, 0.8]),  # "Cars are fast"
}

query = np.array([0.22, 0.78, 0.12])  # "Dogs are amazing"

# Calculate similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = {
    doc_id: cosine_similarity(query, vec)
    for doc_id, vec in docs.items()
}

print(similarities)
# {'doc1': 0.999, 'doc2': 0.998, 'doc3': 0.542}
# β†’ doc1 vΓ  doc2 giα»‘ng query nhαΊ₯t!

Distance Metrics

Ba metrics phα»• biαΊΏn:

1. Cosine Similarity (phα»• biαΊΏn nhαΊ₯t):

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Range: -1 (opposite) to 1 (identical)
# 0 = orthogonal (khΓ΄ng liΓͺn quan)

2. Euclidean Distance (L2):

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)
# Smaller = more similar

3. Dot Product:

def dot_product(a, b):
    return np.dot(a, b)
# Higher = more similar (for normalized vectors)

Khi nào dùng cÑi gì?

  • Cosine: Text embeddings (direction matters, not magnitude)
  • Euclidean: Image embeddings (absolute position matters)
  • Dot Product: When vectors are already normalized

Use Cases thα»±c tαΊΏ

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. Semantic Search                              β”‚
β”‚  User: "how to train a neural network"           β”‚
β”‚  β†’ Find similar docs (not just keyword match)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2. Recommendation Systems                       β”‚
β”‚  User liked movie A                              β”‚
β”‚  β†’ Find movies with similar embeddings           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  3. RAG (Retrieval Augmented Generation)         β”‚
β”‚  ChatGPT question β†’ Find relevant docs           β”‚
β”‚  β†’ Feed to LLM as context                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  4. Image Search                                 β”‚
β”‚  Upload image β†’ Find similar images              β”‚
β”‚  (Google Photos, Pinterest)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  5. Anomaly Detection                            β”‚
β”‚  Normal transactions: cluster tightly            β”‚
β”‚  Fraud: far from cluster β†’ detect!              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

First Vector DB: In-memory vα»›i NumPy

class SimpleVectorDB:
    def __init__(self):
        self.vectors = []
        self.metadata = []
    
    def insert(self, vector, metadata):
        self.vectors.append(vector)
        self.metadata.append(metadata)
    
    def search(self, query_vector, top_k=5):
        # Brute force: compare vα»›i tαΊ₯t cαΊ£ vectors
        similarities = []
        for i, vec in enumerate(self.vectors):
            sim = cosine_similarity(query_vector, vec)
            similarities.append((sim, i))
        
        # Sort by similarity (descending)
        similarities.sort(reverse=True)
        
        # Return top K
        results = []
        for sim, idx in similarities[:top_k]:
            results.append({
                'similarity': sim,
                'metadata': self.metadata[idx]
            })
        return results

# Usage
db = SimpleVectorDB()

# Insert documents
model = SentenceTransformer('all-MiniLM-L6-v2')

docs = [
    "The cat sits on the mat",
    "Dogs are loyal animals",
    "Machine learning is fascinating",
]

for doc in docs:
    embedding = model.encode(doc)
    db.insert(embedding, {'text': doc})

# Search
query = "Tell me about artificial intelligence"
query_embedding = model.encode(query)
results = db.search(query_embedding, top_k=2)

for r in results:
    print(f"Similarity: {r['similarity']:.3f}")
    print(f"Text: {r['metadata']['text']}\n")

Problem vα»›i approach nΓ y?
β†’ O(N) complexity: PhαΊ£i compare vα»›i TαΊ€T CαΊ’ vectors β†’ slow khi cΓ³ millions vectors!


Level 2: Vector Databases & Indexing (Mid-level)

TαΊ‘i sao cαΊ§n Vector Database?

In-memory NumPy:
βœ… Simple
βœ… Good for <10K vectors
❌ Slow (O(N) search)
❌ No persistence
❌ No concurrent access
❌ Không scale

Vector Database:
βœ… Fast search (O(log N) or better)
βœ… Persistent storage
βœ… Horizontal scaling
βœ… Production-ready (transactions, backups, monitoring)
βœ… Approximate search (trade accuracy for speed)
Database Type Best For Language
Pinecone Cloud-native Managed service, easy setup Any (REST API)
Weaviate Full-featured Hybrid search, multi-tenancy Go
Qdrant Modern Performance, Rust-powered Rust
Milvus Enterprise Large scale, Kubernetes C++/Python/Go
Chroma Embedded Development, prototyping Python
pgvector Extension Existing Postgres users SQL
Elasticsearch Search engine Already using ES Java

ANN: Approximate Nearest Neighbor

Trade-off: 100% accuracy vs speed

Exact search (brute force):
- Compare vα»›i tαΊ₯t cαΊ£ N vectors
- 100% accuracy
- O(N) complexity
- Slow: 1M vectors = 1M comparisons

Approximate search (ANN):
- Use index structure (tree, graph)
- ~95-99% accuracy (configurable)
- O(log N) or O(1) complexity
- Fast: 1M vectors = ~20 comparisons

Key insight: Trong most use cases, 99% accuracy lΓ  Δ‘α»§ (user khΓ΄ng thαΊ₯y khΓ‘c biệt).

Indexing Algorithms

1. HNSW (Hierarchical Navigable Small World)

Graph-based: Vectors lΓ  nodes, edges connect similar vectors.

Level 2:  πŸ”΄ ←→ πŸ”΄ ←→ πŸ”΄  (sparse, long jumps)
          ↓      ↓      ↓
Level 1:  πŸ”΅β†’πŸ”΅β†’πŸ”΅β†’πŸ”΅β†’πŸ”΅  (medium density)
          ↓  ↓  ↓  ↓  ↓
Level 0:  🟒🟒🟒🟒🟒🟒🟒🟒  (dense, all vectors)

Search:
1. Start at top level (long jumps)
2. Navigate to closest node
3. Drop to next level
4. Repeat until bottom
5. Refine search at bottom level

Characteristics:

  • βœ… Very fast search (O(log N))
  • βœ… High recall (accuracy)
  • ❌ Slow build time
  • ❌ Memory-intensive (graph in RAM)

Good for: Real-time search, when RAM is available

2. IVF (Inverted File Index)

Clustering-based: Partition vectors into clusters.

1. Build phase:
   - K-means clustering β†’ 1000 clusters
   - Each vector assigned to nearest cluster

2. Search phase:
   - Find query's nearest cluster(s)
   - Search only within those clusters
   - Only check 1/1000 of data!

Example:
Cluster 1: [dogs, puppies, pets, ...]
Cluster 2: [AI, ML, neural nets, ...]
Cluster 3: [cars, vehicles, ...]

Query: "machine learning" β†’ Search Cluster 2 only

Characteristics:

  • βœ… Fast search (O(log N))
  • βœ… Less memory than HNSW
  • ❌ Lower recall (might miss edge cases)
  • βœ… Fast build time

Good for: Large datasets, limited RAM

3. Product Quantization (PQ)

Compression: Reduce vector size β†’ fit more in RAM.

Original vector (768D, 32-bit float):
[0.234, -0.123, 0.567, ..., 0.891]
Size: 768 * 4 bytes = 3KB

After PQ compression:
[23, 145, 89, ...]  (codebook indices)
Size: 768 / 8 * 1 byte = 96 bytes
β†’ 32x compression!

Trade-off: Smaller size vs accuracy

Good for: Billions of vectors, RAM constraints

4. Hybrid Approaches

Most production systems combine algorithms:

Pinecone: IVF + PQ
Weaviate: HNSW + PQ (optional)
Milvus: IVF_FLAT, IVF_SQ8, HNSW, etc. (configurable)

Implementation: Pinecone (Managed Service)

import pinecone
from sentence_transformers import SentenceTransformer

# Initialize
pinecone.init(api_key="your-api-key", environment="us-east1-gcp")

# Create index
pinecone.create_index(
    name="semantic-search",
    dimension=384,  # Model's output dimension
    metric="cosine",
    pod_type="p1.x1"  # Performance tier
)

index = pinecone.Index("semantic-search")

# Embed model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Insert vectors
docs = [
    {"id": "doc1", "text": "Machine learning basics"},
    {"id": "doc2", "text": "Neural networks explained"},
    {"id": "doc3", "text": "Cooking recipes for beginners"},
]

vectors_to_upsert = []
for doc in docs:
    embedding = model.encode(doc['text']).tolist()
    vectors_to_upsert.append({
        "id": doc['id'],
        "values": embedding,
        "metadata": {"text": doc['text']}
    })

index.upsert(vectors=vectors_to_upsert)

# Search
query = "Tell me about AI"
query_embedding = model.encode(query).tolist()

results = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True
)

for match in results['matches']:
    print(f"Score: {match['score']:.3f}")
    print(f"Text: {match['metadata']['text']}\n")

Implementation: Weaviate (Open-source)

import weaviate
from weaviate.classes.config import Configure

# Connect
client = weaviate.connect_to_local()

# Create collection with vectorizer
client.collections.create(
    name="Document",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    properties=[
        {"name": "title", "dataType": ["text"]},
        {"name": "content", "dataType": ["text"]},
    ]
)

collection = client.collections.get("Document")

# Insert (auto-vectorization)
collection.data.insert_many([
    {"title": "AI Intro", "content": "Machine learning basics"},
    {"title": "Cooking 101", "content": "How to make pasta"},
])

# Search (auto-vectorize query)
results = collection.query.near_text(
    query="Tell me about artificial intelligence",
    limit=2
)

for item in results.objects:
    print(f"{item.properties['title']}: {item.properties['content']}")

Implementation: pgvector (PostgreSQL Extension)

-- Install extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding VECTOR(384)  -- 384 dimensions
);

-- Create index (IVF)
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);  -- 100 clusters

-- Insert
INSERT INTO documents (content, embedding) 
VALUES ('Machine learning basics', '[0.1, 0.2, 0.3, ...]');

-- Search
SELECT content, 1 - (embedding <=> '[0.12, 0.21, ...]') AS similarity
FROM documents
ORDER BY embedding <=> '[0.12, 0.21, ...]'
LIMIT 5;

Pros cα»§a pgvector:

  • βœ… Existing Postgres infrastructure
  • βœ… ACID transactions
  • βœ… Join vα»›i relational data
  • βœ… Familiar SQL

Cons:

  • ❌ Slower than specialized vector DBs
  • ❌ Limited indexing options
  • ❌ Harder to scale horizontally

Level 3: Production & Advanced Patterns (Senior)

Hybrid Search: Keyword + Vector

Problem: Pure vector search khΓ΄ng tα»‘t vα»›i exact matches (product IDs, names).

Solution: Combine keyword search (BM25) + vector search.

# Weaviate hybrid search
results = collection.query.hybrid(
    query="iPhone 15",
    alpha=0.5,  # 0=keyword only, 1=vector only, 0.5=balanced
    limit=10
)

How it works:

1. Keyword search (BM25):
   "iPhone 15" β†’ High score for exact matches

2. Vector search:
   "iPhone 15" β†’ High score for semantic matches
   (e.g., "Apple's latest smartphone")

3. Combine scores:
   final_score = alpha * vector_score + (1 - alpha) * keyword_score

Filtering with Metadata

Challenge: "Find similar docs, but only from 2024, in English, tagged 'tech'"

# Pinecone
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "year": {"$eq": 2024},
        "language": {"$eq": "en"},
        "tags": {"$in": ["tech"]}
    }
)

# Weaviate
results = collection.query.near_vector(
    near_vector=query_embedding,
    limit=10,
    where={
        "path": ["year"],
        "operator": "Equal",
        "valueInt": 2024
    }
)

Architecture pattern:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Pre-filtering (before vector search)β”‚
β”‚  Filter by metadata first            β”‚
β”‚  β†’ Smaller candidate set             β”‚
β”‚  β†’ Faster vector search              β”‚
β”‚  ❌ Limited by index structure       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Post-filtering (after vector search)β”‚
β”‚  Vector search first                 β”‚  
β”‚  β†’ Filter results                    β”‚
β”‚  βœ… More flexible                    β”‚
β”‚  ❌ Might miss results if top_k low  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Best practice: Use pre-filtering when possible (faster).

Multi-vector / Late Interaction

Problem: Single vector per document loses nuance.

Solution: Multiple vectors per document.

# ColBERT approach: Token-level embeddings
document = "Machine learning is a subset of AI"
tokens = ["Machine", "learning", "is", "a", "subset", "of", "AI"]

# Each token gets embedding
token_embeddings = [
    model.encode(token) for token in tokens
]  # 7 vectors for 1 document

# Search: Compare query tokens with document tokens
query = "What is ML?"
query_tokens = ["What", "is", "ML"]
query_embeddings = [model.encode(t) for t in query_tokens]

# Max similarity for each query token
score = sum([
    max([cosine_sim(q_emb, d_emb) for d_emb in token_embeddings])
    for q_emb in query_embeddings
])

Use case: Long documents, QA systems.

Chunking Strategies

Problem: Embeddings cΓ³ max length (512 tokens for many models).

Strategy 1: Fixed-size chunks

def chunk_text(text, chunk_size=512, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Pro: Simple
# Con: Might split sentences/paragraphs awkwardly

Strategy 2: Semantic chunking

def semantic_chunk(text):
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = []
    current_length = 0
    
    for para in paragraphs:
        para_len = len(para.split())
        if current_length + para_len > 512:
            chunks.append(' '.join(current_chunk))
            current_chunk = [para]
            current_length = para_len
        else:
            current_chunk.append(para)
            current_length += para_len
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

# Pro: Preserves semantic boundaries
# Con: Variable chunk sizes

Strategy 3: Sliding window with parent-child

# Store both:
# - Small chunks (for precise retrieval)
# - Large parent context (for LLM)

chunks = [
    {"chunk": "ML is a subset of AI", "parent_id": "doc1"},
    {"chunk": "Neural networks are...", "parent_id": "doc1"},
]

# Search on chunks, return parent doc for context

Batch Operations & Indexing

# Bad: One at a time
for doc in documents:
    embedding = model.encode(doc)
    index.upsert([{"id": doc['id'], "values": embedding}])
# Slow: Many network calls

# Good: Batch upsert
batch_size = 100
for i in range(0, len(documents), batch_size):
    batch = documents[i:i+batch_size]
    embeddings = model.encode([d['text'] for d in batch])
    
    vectors = [
        {"id": d['id'], "values": emb.tolist()}
        for d, emb in zip(batch, embeddings)
    ]
    
    index.upsert(vectors=vectors)
# 100x faster

Monitoring & Observability

Key metrics:

# 1. Query latency
import time

start = time.time()
results = index.query(query_vector, top_k=10)
latency = time.time() - start

# Target: <50ms for p99

# 2. Recall (accuracy)
# Compare ANN results vs exact search
def measure_recall(query_vector, k=10):
    # ANN search
    ann_results = index.query(query_vector, top_k=k)
    ann_ids = {r['id'] for r in ann_results['matches']}
    
    # Exact search (brute force)
    exact_results = exact_search(query_vector, k)
    exact_ids = {r['id'] for r in exact_results}
    
    # Recall = intersection / k
    recall = len(ann_ids & exact_ids) / k
    return recall

# Target: >95% recall

# 3. Index build time
# Track re-indexing time when adding new vectors

# 4. Memory usage
# Monitor RAM consumption (especially for HNSW)

# 5. Error rate
# Failed queries, timeouts

Sharding strategies:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. Hash-based sharding                       β”‚
β”‚  Vector ID β†’ hash β†’ shard assignment          β”‚
β”‚  βœ… Even distribution                         β”‚
β”‚  ❌ Each query must hit all shards            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2. Cluster-based sharding                    β”‚
β”‚  Group similar vectors in same shard          β”‚
β”‚  βœ… Query only hits relevant shards           β”‚
β”‚  ❌ Uneven distribution (hot shards)          β”‚
β”‚  ❌ Requires initial clustering               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  3. Replication                               β”‚
β”‚  Each shard replicated 3x                     β”‚
β”‚  βœ… High availability                         β”‚
β”‚  βœ… Read scaling                              β”‚
β”‚  ❌ 3x storage cost                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Milvus distributed example:

# Milvus cluster with 3 query nodes
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: my-milvus
spec:
  mode: cluster
  components:
    queryNode:
      replicas: 3  # 3 query nodes for parallel search
    dataNode:
      replicas: 2  # 2 data nodes for ingestion

Cost Optimization

1. Dimension reduction:

from sklearn.decomposition import PCA

# Original: 768D (OpenAI embedding)
# Reduced: 256D
pca = PCA(n_components=256)
reduced_embeddings = pca.fit_transform(original_embeddings)

# Trade-off: ~10% accuracy loss, 3x storage savings

2. Quantization:

# Convert float32 β†’ int8
# 768D * 4 bytes = 3KB
# 768D * 1 byte = 768 bytes
# β†’ 4x savings

# Pinecone: Automatic
# Milvus:
collection.create_index(
    field_name="embedding",
    index_params={
        "metric_type": "L2",
        "index_type": "IVF_SQ8",  # Scalar quantization to 8-bit
        "params": {"nlist": 1024}
    }
)

3. Lazy loading (tiered storage):

Hot tier (SSD): Recent/popular vectors
Cold tier (HDD/S3): Old/rarely accessed vectors

Move vectors between tiers based on access patterns

Level 4: Bleeding Edge (Senior+)

Learned Indexes

Idea: Use ML model để predict vector position trong index.

# Traditional: Tree/graph traversal
# Learned: Model.predict(vector) β†’ position

# Example: RMI (Recursive Model Index)
# Stage 1: Coarse model (neural net)
#   Input: vector β†’ Output: approximate position range
# Stage 2: Fine model
#   Input: vector + range β†’ Output: exact position

Status: Research phase, not production-ready yet (2026).

Multi-modal Embeddings

CLIP: Unified embedding space for text + images.

import clip
import torch

model, preprocess = clip.load("ViT-B/32")

# Embed image
image = preprocess(Image.open("dog.jpg")).unsqueeze(0)
image_embedding = model.encode_image(image)

# Embed text
text = clip.tokenize(["a photo of a dog"])
text_embedding = model.encode_text(text)

# Compare
similarity = torch.cosine_similarity(image_embedding, text_embedding)

# Use case: Search images with text!
query = "sunset over mountains"
# β†’ Find images matching that description

Streaming Updates

Challenge: Millions of vectors added per day, can't rebuild index.

Solution: Incremental indexing.

# Pinecone: Real-time updates (no rebuild needed)
index.upsert(new_vectors)  # Available immediately

# Milvus: Segment-based
# New vectors β†’ New segment
# Background: Merge segments periodically

Vector Databases at Scale

Netflix: 100M+ vectors (movie/user embeddings)

  • Milvus cluster
  • PQ compression
  • Hybrid search (vector + metadata filters)

Pinterest: 3B+ vectors (image embeddings)

  • Custom C++ implementation
  • GPU-accelerated search
  • Distributed across 100+ nodes

Uber: 1B+ vectors (driver/rider embeddings)

  • Hybrid PostgreSQL + specialized vector engine
  • Real-time updates (<100ms latency)

GPU Acceleration

# Faiss on GPU (Facebook AI Similarity Search)
import faiss

# CPU
index_cpu = faiss.IndexFlatL2(dimension)
index_cpu.add(vectors)
D, I = index_cpu.search(query_vectors, k=10)

# GPU (10-100x faster)
res = faiss.StandardGpuResources()
index_gpu = faiss.index_cpu_to_gpu(res, 0, index_cpu)
D, I = index_gpu.search(query_vectors, k=10)

Use case: Batch similarity computation (recommendations, deduplication).


Production Checklist

For New Projects (0-10K vectors)

  • Chroma or pgvector (simple, embedded)
  • Basic cosine similarity search
  • Metadata filtering
  • Monitor query latency

For Growing Projects (10K-1M vectors)

  • Pinecone (managed) or Weaviate (self-hosted)
  • HNSW or IVF indexing
  • Hybrid search (keyword + vector)
  • Batch upserts
  • Set up monitoring (latency, recall)

For Scale (1M+ vectors)

  • Milvus or Qdrant (horizontal scaling)
  • Quantization (PQ or SQ8)
  • Distributed deployment (3+ nodes)
  • Replication for HA
  • GPU acceleration (if batch workloads)
  • Cost optimization (dimension reduction, tiered storage)
  • A/B test index configurations

For Enterprise (10M+ vectors)

  • Custom tuning (index params, ef_construction, nprobe)
  • Multi-region deployment
  • Disaster recovery plan
  • Security (encryption at rest/transit, access control)
  • Compliance (data residency, audit logs)
  • Dedicated SRE team

Common Mistakes & How to Avoid

1. ❌ Not normalizing vectors

# Bad: Unnormalized vectors
embedding = model.encode(text)  # [0.5, 10.3, -2.4, ...]

# Good: Normalize before storing
from sklearn.preprocessing import normalize
embedding_norm = normalize([embedding])[0]  # L2 norm = 1

Why: Affects cosine similarity (assumes normalized vectors).

2. ❌ Wrong distance metric

# Text embeddings: Use COSINE
index = pinecone.Index(..., metric="cosine")

# Image embeddings: Often EUCLIDEAN
index = pinecone.Index(..., metric="euclidean")

3. ❌ Forgetting metadata

# Bad: Only store vectors
index.upsert([{"id": "1", "values": embedding}])
# β†’ Can't filter, can't show original text

# Good: Store metadata
index.upsert([{
    "id": "1",
    "values": embedding,
    "metadata": {
        "text": original_text,
        "source": "wikipedia",
        "date": "2024-01-15",
        "tags": ["AI", "ML"]
    }
}])

4. ❌ Not tuning index parameters

# Default HNSW params might not be optimal
# Tune ef_construction, M for your use case

# Low latency, OK with lower recall:
index.create_index(index_type="HNSW", params={"M": 16, "efConstruction": 100})

# High recall, OK with higher latency:
index.create_index(index_type="HNSW", params={"M": 64, "efConstruction": 500})

5. ❌ Synchronous embedding generation

# Bad: Block API response
@app.post("/search")
def search(query: str):
    embedding = model.encode(query)  # 100-500ms!
    results = index.query(embedding)
    return results

# Good: Cache embeddings or use async
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_embedding(text: str):
    return model.encode(text)

Resources

Vector Databases:

Embeddings:

Research Papers:

  • "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs" (HNSW)
  • "Product Quantization for Nearest Neighbor Search" (PQ)
  • "Learning to Index" (Learned indexes)

Blogs:


Interview Questions

Junior:

  • Vector embedding lΓ  gΓ¬? TαΊ‘i sao cαΊ§n nΓ³?
  • Cosine similarity vs Euclidean distance β€” khi nΓ o dΓΉng cΓ‘i nΓ o?
  • Use cases cα»§a vector database?

Mid:

  • So sΓ‘nh HNSW vs IVF indexing
  • Trade-offs giα»―a exact search vs ANN
  • ThiαΊΏt kαΊΏ RAG system vα»›i vector DB

Senior:

  • Distributed vector search architecture
  • Cost optimization strategies cho 1B+ vectors
  • Hybrid search implementation (keyword + vector)

Senior+:

  • Multi-modal embedding challenges
  • Learned indexes for vector search
  • Real-time updates at scale

TΓ³m tαΊ―t

Level Focus Tools
New Vector concepts, embeddings, similarity NumPy, sentence-transformers
Junior Vector DB basics, indexing Chroma, pgvector, Pinecone
Mid Production deployment, hybrid search Weaviate, Qdrant, Milvus
Senior Scale, distributed systems, cost optimization Custom configs, monitoring, GPU
Senior+ Bleeding edge (multi-modal, learned indexes) Research papers, custom solutions

Key takeaway: Vector databases power semantic search β€” understanding embeddings β†’ similarity β†’ indexing lΓ  foundation. From there, scale up dα»±a trΓͺn use case.


ChΓΊc bαΊ‘n build được nhα»―ng AI-powered applications tuyệt vời! πŸš€

BΖ°α»›c tiαΊΏp theo:

  • storage-and-indexing.md β€” Traditional indexing (B-tree vs vector indexes)
  • query-and-transactions.md β€” Query optimization
  • Hands-on: Build semantic search cho docs cα»§a bαΊ‘n!