🗄️ Database✍️ Khoa📅 19/04/2026☕ 17 phút đọc

Vector Databases: Từ Zero đến Senior+

Vector databases là "người hùng thầm lặng" đằng sau mọi AI application bạn dùng hằng ngày: ChatGPT search, Spotify recommendations, Google Photos face recognition. Nếu bạn đang build AI-powered apps, đây là kiến thức must-have.

🎯 Ẩn dụ: Traditional DB như tra từ điển (exact match). Vector DB như hỏi "cho tôi những từ có nghĩa tương tự" → semantic search.

Level 1: Foundations (New → Junior)

Vector là gì?

Vector = Array of numbers representing meaning/features.

# Text example
"dog" → [0.2, 0.8, 0.1, 0.5, ...]  # 768 dimensions
"puppy" → [0.25, 0.75, 0.15, 0.48, ...]  # Similar to "dog"
"car" → [0.9, 0.1, 0.8, 0.2, ...]  # Different from "dog"

# Image example
🐕 (image) → [0.34, 0.12, 0.89, ...]
🚗 (image) → [0.91, 0.05, 0.23, ...]

Key insight: Vectors capture semantic meaning — similar concepts have similar vectors.

Embeddings: Từ Data → Vector

Embedding = Process chuyển data (text, image, audio) thành vector.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed text
text = "I love machine learning"
embedding = model.encode(text)
print(embedding.shape)  # (384,) - 384-dimensional vector
print(embedding[:5])    # [0.0234, -0.1234, 0.5678, ...]

Popular embedding models:

Text: OpenAI text-embedding-3-small, Cohere Embed, sentence-transformers
Images: CLIP, ResNet, ViT
Code: CodeBERT, GraphCodeBERT
Multimodal: CLIP (text + images)

Similarity Search: Core Operation

Problem: "Tìm 10 items giống nhất với query này"

import numpy as np

# 3 documents in our "database"
docs = {
    "doc1": np.array([0.2, 0.8, 0.1]),  # "I love dogs"
    "doc2": np.array([0.25, 0.75, 0.15]),  # "Puppies are cute"
    "doc3": np.array([0.9, 0.1, 0.8]),  # "Cars are fast"
}

query = np.array([0.22, 0.78, 0.12])  # "Dogs are amazing"

# Calculate similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = {
    doc_id: cosine_similarity(query, vec)
    for doc_id, vec in docs.items()
}

print(similarities)
# {'doc1': 0.999, 'doc2': 0.998, 'doc3': 0.542}
# → doc1 và doc2 giống query nhất!

Distance Metrics

Ba metrics phổ biến:

1. Cosine Similarity (phổ biến nhất):

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Range: -1 (opposite) to 1 (identical)
# 0 = orthogonal (không liên quan)

2. Euclidean Distance (L2):

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)
# Smaller = more similar

3. Dot Product:

def dot_product(a, b):
    return np.dot(a, b)
# Higher = more similar (for normalized vectors)

Khi nào dùng cái gì?

Cosine: Text embeddings (direction matters, not magnitude)
Euclidean: Image embeddings (absolute position matters)
Dot Product: When vectors are already normalized

Use Cases thực tế

┌──────────────────────────────────────────────────┐
│  1. Semantic Search                              │
│  User: "how to train a neural network"           │
│  → Find similar docs (not just keyword match)    │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  2. Recommendation Systems                       │
│  User liked movie A                              │
│  → Find movies with similar embeddings           │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  3. RAG (Retrieval Augmented Generation)         │
│  ChatGPT question → Find relevant docs           │
│  → Feed to LLM as context                        │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  4. Image Search                                 │
│  Upload image → Find similar images              │
│  (Google Photos, Pinterest)                      │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  5. Anomaly Detection                            │
│  Normal transactions: cluster tightly            │
│  Fraud: far from cluster → detect!              │
└──────────────────────────────────────────────────┘

First Vector DB: In-memory với NumPy

class SimpleVectorDB:
    def __init__(self):
        self.vectors = []
        self.metadata = []
    
    def insert(self, vector, metadata):
        self.vectors.append(vector)
        self.metadata.append(metadata)
    
    def search(self, query_vector, top_k=5):
        # Brute force: compare với tất cả vectors
        similarities = []
        for i, vec in enumerate(self.vectors):
            sim = cosine_similarity(query_vector, vec)
            similarities.append((sim, i))
        
        # Sort by similarity (descending)
        similarities.sort(reverse=True)
        
        # Return top K
        results = []
        for sim, idx in similarities[:top_k]:
            results.append({
                'similarity': sim,
                'metadata': self.metadata[idx]
            })
        return results

# Usage
db = SimpleVectorDB()

# Insert documents
model = SentenceTransformer('all-MiniLM-L6-v2')

docs = [
    "The cat sits on the mat",
    "Dogs are loyal animals",
    "Machine learning is fascinating",
]

for doc in docs:
    embedding = model.encode(doc)
    db.insert(embedding, {'text': doc})

# Search
query = "Tell me about artificial intelligence"
query_embedding = model.encode(query)
results = db.search(query_embedding, top_k=2)

for r in results:
    print(f"Similarity: {r['similarity']:.3f}")
    print(f"Text: {r['metadata']['text']}\n")

Problem với approach này?
→ O(N) complexity: Phải compare với TẤT CẢ vectors → slow khi có millions vectors!

Level 2: Vector Databases & Indexing (Mid-level)

Tại sao cần Vector Database?

In-memory NumPy:
✅ Simple
✅ Good for <10K vectors
❌ Slow (O(N) search)
❌ No persistence
❌ No concurrent access
❌ Không scale

Vector Database:
✅ Fast search (O(log N) or better)
✅ Persistent storage
✅ Horizontal scaling
✅ Production-ready (transactions, backups, monitoring)
✅ Approximate search (trade accuracy for speed)

Popular Vector Databases

Database	Type	Best For	Language
Pinecone	Cloud-native	Managed service, easy setup	Any (REST API)
Weaviate	Full-featured	Hybrid search, multi-tenancy	Go
Qdrant	Modern	Performance, Rust-powered	Rust
Milvus	Enterprise	Large scale, Kubernetes	C++/Python/Go
Chroma	Embedded	Development, prototyping	Python
pgvector	Extension	Existing Postgres users	SQL
Elasticsearch	Search engine	Already using ES	Java

ANN: Approximate Nearest Neighbor

Trade-off: 100% accuracy vs speed

Exact search (brute force):
- Compare với tất cả N vectors
- 100% accuracy
- O(N) complexity
- Slow: 1M vectors = 1M comparisons

Approximate search (ANN):
- Use index structure (tree, graph)
- ~95-99% accuracy (configurable)
- O(log N) or O(1) complexity
- Fast: 1M vectors = ~20 comparisons

Key insight: Trong most use cases, 99% accuracy là đủ (user không thấy khác biệt).

Indexing Algorithms

1. HNSW (Hierarchical Navigable Small World)

Graph-based: Vectors là nodes, edges connect similar vectors.

Level 2:  🔴 ←→ 🔴 ←→ 🔴  (sparse, long jumps)
          ↓      ↓      ↓
Level 1:  🔵→🔵→🔵→🔵→🔵  (medium density)
          ↓  ↓  ↓  ↓  ↓
Level 0:  🟢🟢🟢🟢🟢🟢🟢🟢  (dense, all vectors)

Search:
1. Start at top level (long jumps)
2. Navigate to closest node
3. Drop to next level
4. Repeat until bottom
5. Refine search at bottom level

Characteristics:

✅ Very fast search (O(log N))
✅ High recall (accuracy)
❌ Slow build time
❌ Memory-intensive (graph in RAM)

Good for: Real-time search, when RAM is available

2. IVF (Inverted File Index)

Clustering-based: Partition vectors into clusters.

1. Build phase:
   - K-means clustering → 1000 clusters
   - Each vector assigned to nearest cluster

2. Search phase:
   - Find query's nearest cluster(s)
   - Search only within those clusters
   - Only check 1/1000 of data!

Example:
Cluster 1: [dogs, puppies, pets, ...]
Cluster 2: [AI, ML, neural nets, ...]
Cluster 3: [cars, vehicles, ...]

Query: "machine learning" → Search Cluster 2 only

Characteristics:

✅ Fast search (O(log N))
✅ Less memory than HNSW
❌ Lower recall (might miss edge cases)
✅ Fast build time

Good for: Large datasets, limited RAM

3. Product Quantization (PQ)

Compression: Reduce vector size → fit more in RAM.

Original vector (768D, 32-bit float):
[0.234, -0.123, 0.567, ..., 0.891]
Size: 768 * 4 bytes = 3KB

After PQ compression:
[23, 145, 89, ...]  (codebook indices)
Size: 768 / 8 * 1 byte = 96 bytes
→ 32x compression!

Trade-off: Smaller size vs accuracy

Good for: Billions of vectors, RAM constraints

4. Hybrid Approaches

Most production systems combine algorithms:

Pinecone: IVF + PQ
Weaviate: HNSW + PQ (optional)
Milvus: IVF_FLAT, IVF_SQ8, HNSW, etc. (configurable)

Implementation: Pinecone (Managed Service)

import pinecone
from sentence_transformers import SentenceTransformer

# Initialize
pinecone.init(api_key="your-api-key", environment="us-east1-gcp")

# Create index
pinecone.create_index(
    name="semantic-search",
    dimension=384,  # Model's output dimension
    metric="cosine",
    pod_type="p1.x1"  # Performance tier
)

index = pinecone.Index("semantic-search")

# Embed model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Insert vectors
docs = [
    {"id": "doc1", "text": "Machine learning basics"},
    {"id": "doc2", "text": "Neural networks explained"},
    {"id": "doc3", "text": "Cooking recipes for beginners"},
]

vectors_to_upsert = []
for doc in docs:
    embedding = model.encode(doc['text']).tolist()
    vectors_to_upsert.append({
        "id": doc['id'],
        "values": embedding,
        "metadata": {"text": doc['text']}
    })

index.upsert(vectors=vectors_to_upsert)

# Search
query = "Tell me about AI"
query_embedding = model.encode(query).tolist()

results = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True
)

for match in results['matches']:
    print(f"Score: {match['score']:.3f}")
    print(f"Text: {match['metadata']['text']}\n")

Implementation: Weaviate (Open-source)

import weaviate
from weaviate.classes.config import Configure

# Connect
client = weaviate.connect_to_local()

# Create collection with vectorizer
client.collections.create(
    name="Document",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    properties=[
        {"name": "title", "dataType": ["text"]},
        {"name": "content", "dataType": ["text"]},
    ]
)

collection = client.collections.get("Document")

# Insert (auto-vectorization)
collection.data.insert_many([
    {"title": "AI Intro", "content": "Machine learning basics"},
    {"title": "Cooking 101", "content": "How to make pasta"},
])

# Search (auto-vectorize query)
results = collection.query.near_text(
    query="Tell me about artificial intelligence",
    limit=2
)

for item in results.objects:
    print(f"{item.properties['title']}: {item.properties['content']}")

Implementation: pgvector (PostgreSQL Extension)

-- Install extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding VECTOR(384)  -- 384 dimensions
);

-- Create index (IVF)
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);  -- 100 clusters

-- Insert
INSERT INTO documents (content, embedding) 
VALUES ('Machine learning basics', '[0.1, 0.2, 0.3, ...]');

-- Search
SELECT content, 1 - (embedding <=> '[0.12, 0.21, ...]') AS similarity
FROM documents
ORDER BY embedding <=> '[0.12, 0.21, ...]'
LIMIT 5;

Pros của pgvector:

✅ Existing Postgres infrastructure
✅ ACID transactions
✅ Join với relational data
✅ Familiar SQL

Cons:

❌ Slower than specialized vector DBs
❌ Limited indexing options
❌ Harder to scale horizontally

Level 3: Production & Advanced Patterns (Senior)

Hybrid Search: Keyword + Vector

Problem: Pure vector search không tốt với exact matches (product IDs, names).

Solution: Combine keyword search (BM25) + vector search.

# Weaviate hybrid search
results = collection.query.hybrid(
    query="iPhone 15",
    alpha=0.5,  # 0=keyword only, 1=vector only, 0.5=balanced
    limit=10
)

How it works:

1. Keyword search (BM25):
   "iPhone 15" → High score for exact matches

2. Vector search:
   "iPhone 15" → High score for semantic matches
   (e.g., "Apple's latest smartphone")

3. Combine scores:
   final_score = alpha * vector_score + (1 - alpha) * keyword_score

Filtering with Metadata

Challenge: "Find similar docs, but only from 2024, in English, tagged 'tech'"

# Pinecone
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "year": {"$eq": 2024},
        "language": {"$eq": "en"},
        "tags": {"$in": ["tech"]}
    }
)

# Weaviate
results = collection.query.near_vector(
    near_vector=query_embedding,
    limit=10,
    where={
        "path": ["year"],
        "operator": "Equal",
        "valueInt": 2024
    }
)

Architecture pattern:

┌──────────────────────────────────────┐
│  Pre-filtering (before vector search)│
│  Filter by metadata first            │
│  → Smaller candidate set             │
│  → Faster vector search              │
│  ❌ Limited by index structure       │
└──────────────────────────────────────┘

┌──────────────────────────────────────┐
│  Post-filtering (after vector search)│
│  Vector search first                 │  
│  → Filter results                    │
│  ✅ More flexible                    │
│  ❌ Might miss results if top_k low  │
└──────────────────────────────────────┘

Best practice: Use pre-filtering when possible (faster).

Multi-vector / Late Interaction

Problem: Single vector per document loses nuance.

Solution: Multiple vectors per document.

# ColBERT approach: Token-level embeddings
document = "Machine learning is a subset of AI"
tokens = ["Machine", "learning", "is", "a", "subset", "of", "AI"]

# Each token gets embedding
token_embeddings = [
    model.encode(token) for token in tokens
]  # 7 vectors for 1 document

# Search: Compare query tokens with document tokens
query = "What is ML?"
query_tokens = ["What", "is", "ML"]
query_embeddings = [model.encode(t) for t in query_tokens]

# Max similarity for each query token
score = sum([
    max([cosine_sim(q_emb, d_emb) for d_emb in token_embeddings])
    for q_emb in query_embeddings
])

Use case: Long documents, QA systems.

Chunking Strategies

Problem: Embeddings có max length (512 tokens for many models).

Strategy 1: Fixed-size chunks

def chunk_text(text, chunk_size=512, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Pro: Simple
# Con: Might split sentences/paragraphs awkwardly

Strategy 2: Semantic chunking

def semantic_chunk(text):
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = []
    current_length = 0
    
    for para in paragraphs:
        para_len = len(para.split())
        if current_length + para_len > 512:
            chunks.append(' '.join(current_chunk))
            current_chunk = [para]
            current_length = para_len
        else:
            current_chunk.append(para)
            current_length += para_len
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

# Pro: Preserves semantic boundaries
# Con: Variable chunk sizes

Strategy 3: Sliding window with parent-child

# Store both:
# - Small chunks (for precise retrieval)
# - Large parent context (for LLM)

chunks = [
    {"chunk": "ML is a subset of AI", "parent_id": "doc1"},
    {"chunk": "Neural networks are...", "parent_id": "doc1"},
]

# Search on chunks, return parent doc for context

Batch Operations & Indexing

# Bad: One at a time
for doc in documents:
    embedding = model.encode(doc)
    index.upsert([{"id": doc['id'], "values": embedding}])
# Slow: Many network calls

# Good: Batch upsert
batch_size = 100
for i in range(0, len(documents), batch_size):
    batch = documents[i:i+batch_size]
    embeddings = model.encode([d['text'] for d in batch])
    
    vectors = [
        {"id": d['id'], "values": emb.tolist()}
        for d, emb in zip(batch, embeddings)
    ]
    
    index.upsert(vectors=vectors)
# 100x faster

Monitoring & Observability

Key metrics:

# 1. Query latency
import time

start = time.time()
results = index.query(query_vector, top_k=10)
latency = time.time() - start

# Target: <50ms for p99

# 2. Recall (accuracy)
# Compare ANN results vs exact search
def measure_recall(query_vector, k=10):
    # ANN search
    ann_results = index.query(query_vector, top_k=k)
    ann_ids = {r['id'] for r in ann_results['matches']}
    
    # Exact search (brute force)
    exact_results = exact_search(query_vector, k)
    exact_ids = {r['id'] for r in exact_results}
    
    # Recall = intersection / k
    recall = len(ann_ids & exact_ids) / k
    return recall

# Target: >95% recall

# 3. Index build time
# Track re-indexing time when adding new vectors

# 4. Memory usage
# Monitor RAM consumption (especially for HNSW)

# 5. Error rate
# Failed queries, timeouts

Distributed Vector Search

Sharding strategies:

┌───────────────────────────────────────────────┐
│  1. Hash-based sharding                       │
│  Vector ID → hash → shard assignment          │
│  ✅ Even distribution                         │
│  ❌ Each query must hit all shards            │
└───────────────────────────────────────────────┘

┌───────────────────────────────────────────────┐
│  2. Cluster-based sharding                    │
│  Group similar vectors in same shard          │
│  ✅ Query only hits relevant shards           │
│  ❌ Uneven distribution (hot shards)          │
│  ❌ Requires initial clustering               │
└───────────────────────────────────────────────┘

┌───────────────────────────────────────────────┐
│  3. Replication                               │
│  Each shard replicated 3x                     │
│  ✅ High availability                         │
│  ✅ Read scaling                              │
│  ❌ 3x storage cost                           │
└───────────────────────────────────────────────┘

Milvus distributed example:

# Milvus cluster with 3 query nodes
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: my-milvus
spec:
  mode: cluster
  components:
    queryNode:
      replicas: 3  # 3 query nodes for parallel search
    dataNode:
      replicas: 2  # 2 data nodes for ingestion

Cost Optimization

1. Dimension reduction:

from sklearn.decomposition import PCA

# Original: 768D (OpenAI embedding)
# Reduced: 256D
pca = PCA(n_components=256)
reduced_embeddings = pca.fit_transform(original_embeddings)

# Trade-off: ~10% accuracy loss, 3x storage savings

2. Quantization:

# Convert float32 → int8
# 768D * 4 bytes = 3KB
# 768D * 1 byte = 768 bytes
# → 4x savings

# Pinecone: Automatic
# Milvus:
collection.create_index(
    field_name="embedding",
    index_params={
        "metric_type": "L2",
        "index_type": "IVF_SQ8",  # Scalar quantization to 8-bit
        "params": {"nlist": 1024}
    }
)

3. Lazy loading (tiered storage):

Hot tier (SSD): Recent/popular vectors
Cold tier (HDD/S3): Old/rarely accessed vectors

Move vectors between tiers based on access patterns

Level 4: Bleeding Edge (Senior+)

Learned Indexes

Idea: Use ML model để predict vector position trong index.

# Traditional: Tree/graph traversal
# Learned: Model.predict(vector) → position

# Example: RMI (Recursive Model Index)
# Stage 1: Coarse model (neural net)
#   Input: vector → Output: approximate position range
# Stage 2: Fine model
#   Input: vector + range → Output: exact position

Status: Research phase, not production-ready yet (2026).

CLIP: Unified embedding space for text + images.

import clip
import torch

model, preprocess = clip.load("ViT-B/32")

# Embed image
image = preprocess(Image.open("dog.jpg")).unsqueeze(0)
image_embedding = model.encode_image(image)

# Embed text
text = clip.tokenize(["a photo of a dog"])
text_embedding = model.encode_text(text)

# Compare
similarity = torch.cosine_similarity(image_embedding, text_embedding)

# Use case: Search images with text!
query = "sunset over mountains"
# → Find images matching that description

Streaming Updates

Challenge: Millions of vectors added per day, can't rebuild index.

Solution: Incremental indexing.

# Pinecone: Real-time updates (no rebuild needed)
index.upsert(new_vectors)  # Available immediately

# Milvus: Segment-based
# New vectors → New segment
# Background: Merge segments periodically

Vector Databases at Scale

Netflix: 100M+ vectors (movie/user embeddings)

Milvus cluster
PQ compression
Hybrid search (vector + metadata filters)

Pinterest: 3B+ vectors (image embeddings)

Custom C++ implementation
GPU-accelerated search
Distributed across 100+ nodes

Uber: 1B+ vectors (driver/rider embeddings)

Hybrid PostgreSQL + specialized vector engine
Real-time updates (<100ms latency)

GPU Acceleration

# Faiss on GPU (Facebook AI Similarity Search)
import faiss

# CPU
index_cpu = faiss.IndexFlatL2(dimension)
index_cpu.add(vectors)
D, I = index_cpu.search(query_vectors, k=10)

# GPU (10-100x faster)
res = faiss.StandardGpuResources()
index_gpu = faiss.index_cpu_to_gpu(res, 0, index_cpu)
D, I = index_gpu.search(query_vectors, k=10)

Use case: Batch similarity computation (recommendations, deduplication).

Production Checklist

For New Projects (0-10K vectors)

Chroma or pgvector (simple, embedded)
Basic cosine similarity search
Metadata filtering
Monitor query latency

For Growing Projects (10K-1M vectors)

Pinecone (managed) or Weaviate (self-hosted)
HNSW or IVF indexing
Hybrid search (keyword + vector)
Batch upserts
Set up monitoring (latency, recall)

For Scale (1M+ vectors)

Milvus or Qdrant (horizontal scaling)
Quantization (PQ or SQ8)
Distributed deployment (3+ nodes)
Replication for HA
GPU acceleration (if batch workloads)
Cost optimization (dimension reduction, tiered storage)
A/B test index configurations

For Enterprise (10M+ vectors)

Custom tuning (index params, ef_construction, nprobe)
Multi-region deployment
Disaster recovery plan
Security (encryption at rest/transit, access control)
Compliance (data residency, audit logs)
Dedicated SRE team

Common Mistakes & How to Avoid

1. ❌ Not normalizing vectors

# Bad: Unnormalized vectors
embedding = model.encode(text)  # [0.5, 10.3, -2.4, ...]

# Good: Normalize before storing
from sklearn.preprocessing import normalize
embedding_norm = normalize([embedding])[0]  # L2 norm = 1

Why: Affects cosine similarity (assumes normalized vectors).

2. ❌ Wrong distance metric

# Text embeddings: Use COSINE
index = pinecone.Index(..., metric="cosine")

# Image embeddings: Often EUCLIDEAN
index = pinecone.Index(..., metric="euclidean")

3. ❌ Forgetting metadata

# Bad: Only store vectors
index.upsert([{"id": "1", "values": embedding}])
# → Can't filter, can't show original text

# Good: Store metadata
index.upsert([{
    "id": "1",
    "values": embedding,
    "metadata": {
        "text": original_text,
        "source": "wikipedia",
        "date": "2024-01-15",
        "tags": ["AI", "ML"]
    }
}])

4. ❌ Not tuning index parameters

# Default HNSW params might not be optimal
# Tune ef_construction, M for your use case

# Low latency, OK with lower recall:
index.create_index(index_type="HNSW", params={"M": 16, "efConstruction": 100})

# High recall, OK with higher latency:
index.create_index(index_type="HNSW", params={"M": 64, "efConstruction": 500})

5. ❌ Synchronous embedding generation

# Bad: Block API response
@app.post("/search")
def search(query: str):
    embedding = model.encode(query)  # 100-500ms!
    results = index.query(embedding)
    return results

# Good: Cache embeddings or use async
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_embedding(text: str):
    return model.encode(text)

Resources

Vector Databases:

Embeddings:

Research Papers:

"Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs" (HNSW)
"Product Quantization for Nearest Neighbor Search" (PQ)
"Learning to Index" (Learned indexes)

Blogs:

Interview Questions

Junior:

Vector embedding là gì? Tại sao cần nó?
Cosine similarity vs Euclidean distance — khi nào dùng cái nào?
Use cases của vector database?

Mid:

So sánh HNSW vs IVF indexing
Trade-offs giữa exact search vs ANN
Thiết kế RAG system với vector DB

Senior:

Distributed vector search architecture
Cost optimization strategies cho 1B+ vectors
Hybrid search implementation (keyword + vector)

Senior+:

Multi-modal embedding challenges
Learned indexes for vector search
Real-time updates at scale

Tóm tắt

Level	Focus	Tools
New	Vector concepts, embeddings, similarity	NumPy, sentence-transformers
Junior	Vector DB basics, indexing	Chroma, pgvector, Pinecone
Mid	Production deployment, hybrid search	Weaviate, Qdrant, Milvus
Senior	Scale, distributed systems, cost optimization	Custom configs, monitoring, GPU
Senior+	Bleeding edge (multi-modal, learned indexes)	Research papers, custom solutions

Key takeaway: Vector databases power semantic search — understanding embeddings → similarity → indexing là foundation. From there, scale up dựa trên use case.

Chúc bạn build được những AI-powered applications tuyệt vời! 🚀

Bước tiếp theo:

storage-and-indexing.md — Traditional indexing (B-tree vs vector indexes)
query-and-transactions.md — Query optimization
Hands-on: Build semantic search cho docs của bạn!