🗄️ Database✍️ Khoa📅 19/04/2026☕ 9 phút đọc

Database: Storage và Indexing

1. Storage Engines — Tổng quan

Database = Query Layer + Storage Engine. Query layer nhận SQL và quyết định what cần lấy; storage engine quyết định how data được lưu và đọc trên disk.

SQL Query
    │
    ▼ Lexer + Parser → Optimizer
┌─────────────────────────────────┐
│         Query Layer             │
│  Parser → Planner → Executor   │
└──────────────┬──────────────────┘
               │
    ┌──────────▼──────────┐
    │    Storage Engine   │
    │  B-Tree  │  LSM     │
    └──────────┬──────────┘
               │
    ┌──────────▼──────────┐
    │   OS Page Cache     │  ← kernel buffer
    └──────────┬──────────┘
               │
    ┌──────────▼──────────┐
    │    Disk / SSD       │
    └─────────────────────┘

Ví dụ:
  MySQL InnoDB    → B-Tree (clustered)
  PostgreSQL      → B-Tree (heap-based)
  RocksDB         → LSM-Tree
  Cassandra       → LSM-Tree

Tại sao disk I/O quan trọng?

Mọi quyết định thiết kế trong storage engine đều xoay quanh một sự thật đơn giản: RAM nhanh hơn disk hàng nghìn lần.

L1 cache:        ~1 ns
L2 cache:        ~4 ns
RAM:             ~100 ns
NVMe SSD:        ~100 µs   (1,000× chậm hơn RAM)
SATA SSD:        ~500 µs
HDD (seek+read): ~10 ms    (100,000× chậm hơn RAM)

→ Database design = minimize disk I/O. Mọi cấu trúc dữ liệu và thuật toán đều được thiết kế quanh nguyên tắc này.

2. B-Tree — Nền tảng của RDBMS

2.1 Cấu trúc B-Tree

B-Tree (và B+Tree) là cây cân bằng, mỗi node chứa nhiều keys, tối ưu cho block-based storage. Lý do B+Tree phổ biến hơn B-Tree thuần: chỉ leaf nodes mới chứa data, còn internal nodes chỉ chứa keys để navigate — giúp branching factor lớn hơn, cây thấp hơn, ít disk I/O hơn.

B+Tree (dùng trong hầu hết DB):

                    [30 | 70]                    ← Root node
                   /    |    \
          [10|20]    [40|50|60]    [80|90]       ← Internal nodes (chỉ có keys)
          /  |  \    / | | | \    /  |  \
        [5][15][25][35][45][55][65][75][85][95]  ← Leaf nodes (chứa data)
         ↔   ↔   ↔   ↔   ↔   ↔   ↔   ↔   ↔     ← Linked list

Đặc điểm:
  - Leaf nodes linked list → range queries cực hiệu quả
  - Mọi leaf ở cùng depth → O(log n) guaranteed

2.2 Node Structure trên Disk

Page size = 16KB (MySQL InnoDB default)

┌─────────────────────────────────────────────────────┐
│  Page Header (38 bytes)                             │
├─────────────────────────────────────────────────────┤
│  Record 1: [key=10 | ptr_to_child | extra_info]     │
│  Record 2: [key=20 | ptr_to_child | extra_info]     │
│  Record 3: [key=30 | ptr_to_child | extra_info]     │
├─────────────────────────────────────────────────────┤
│  Page Directory (slot array, binary search)         │
├─────────────────────────────────────────────────────┤
│  Page Trailer (checksum)                            │
└─────────────────────────────────────────────────────┘

Tree height = log_B(N) với B = branching factor
  16KB page, 16-byte key, 8-byte pointer → B ≈ 680
  → 1 triệu records: height ≈ 2-3
  → Chỉ 2-3 disk reads để tìm bất kỳ record nào!

2.3 B-Tree Write — Tại sao tốn kém

Insert bình thường thì chỉ cần write 1 page. Vấn đề xảy ra khi node đầy và phải split:

Node đầy → Split:
  [10|20|30|40|50] → FULL
  Split thành [10|20|30] và [40|50]
  Promote middle key lên parent
  → Parent cũng có thể full → split lan lên
  → Worst case: O(log n) pages phải rewrite

PostgreSQL dùng Copy-on-Write B-Tree: không modify page in-place mà tạo page mới và update pointer. Old pages vẫn tồn tại → snapshot isolation tự nhiên, nhưng cần VACUUM để dọn dẹp.

2.4 WAL — Write-Ahead Log

Đây là cơ chế đảm bảo durability. Vấn đề: nếu đang write B-Tree page mà crash → page bị corrupt. Giải pháp: luôn ghi vào WAL (append-only, sequential) trước, rồi mới apply vào data pages.

Crash recovery:
  → Đọc WAL từ last checkpoint
  → Redo committed transactions
  → Undo uncommitted transactions

  ┌──────────────────────────────────────────┐
  │  WAL File (append-only, sequential)      │
  │  [LSN=1: INSERT row] [LSN=2: UPDATE...]  │
  │  [LSN=3: COMMIT]     [LSN=4: INSERT...]  │
  └──────────────────────────────────────────┘
          ↓ periodically
  ┌──────────────────────────────────────────┐
  │  Data Pages (B-Tree)                     │
  │  (có thể bị dirty, chưa flush)           │
  └──────────────────────────────────────────┘

LSN = Log Sequence Number (monotonically increasing)
Checkpoint = điểm đảm bảo tất cả pages trước đây đã flush

Sequential write vào WAL nhanh hơn random write vào B-Tree pages rất nhiều — đây là lý do WAL không làm chậm write throughput đáng kể.

3. LSM-Tree — Write-optimized

Log-Structured Merge Tree. Dùng trong RocksDB, Cassandra, LevelDB, ScyllaDB.

3.1 Vấn đề B-Tree với write-heavy workload

B-Tree có write amplification cao: insert 1 record có thể kéo theo rewrite nhiều pages khi có node split. Trên HDD, random writes còn tốn thêm seek time. Trên SSD, random writes gây write amplification ở tầng flash, giảm tuổi thọ.

LSM giải quyết bằng một insight đơn giản: biến random writes thành sequential writes.

3.2 Kiến trúc LSM-Tree

WRITE PATH:

Step 1: Write vào WAL (crash safety — giống B-Tree)

Step 2: Write vào MemTable (in-memory, sorted)
  MemTable = Red-Black Tree hoặc Skip List
  → Sorted, in RAM → cực nhanh

Step 3: Khi MemTable đầy (~64MB) → flush to disk as SSTable
  SSTable = Sorted String Table (immutable file)
  Level 0: SSTables mới nhất, có thể overlap key ranges

Step 4: Background compaction — merge SSTables
  Level 1 (~10MB):   [SSTable: 1-100] [SSTable: 101-200]  (no overlap!)
  Level 2 (~100MB):  [1-50] [51-100] [101-150] ...
  Level N:           tăng 10x mỗi level

Lý do LSM nhanh hơn B-Tree cho writes: mọi write đều là append (WAL + MemTable). Không có random I/O. Compaction chạy background và là sequential I/O.

3.3 SSTable Format

SSTable (immutable, sorted, on-disk):

┌──────────────────────────────────────────────────┐
│  Data Blocks (4KB mỗi block)                     │
│  [key1:val1][key2:val2]...[keyN:valN]             │
├──────────────────────────────────────────────────┤
│  Index Block                                     │
│  [key1→offset0] [keyN+1→offset4096] ...          │
├──────────────────────────────────────────────────┤
│  Bloom Filter                                    │
│  [bit array — check key có trong file không]     │
├──────────────────────────────────────────────────┤
│  Metadata & Footer                               │
└──────────────────────────────────────────────────┘

3.4 Bloom Filter — Tại sao quan trọng

LSM read phải check nhiều SSTables ở nhiều levels — nếu phải đọc từng file để tìm key thì rất tốn I/O. Bloom Filter giải quyết vấn đề này.

Bloom Filter là probabilistic data structure dùng ~10 bits/key:

"Key không có trong file": 100% chính xác → skip file hoàn toàn
"Key có trong file": ~1% false positive → phải đọc để verify

Lookup:
  Hash key với k hash functions → kiểm tra k bits
  Nếu bất kỳ bit nào = 0 → CHẮC CHẮN không có → skip SSTable
  Nếu tất cả bits = 1 → CÓ THỂ có → đọc block để verify

→ Bloom filter giảm ~90%+ unnecessary disk reads cho point queries.

3.5 Compaction Strategies

Size-Tiered Compaction (Cassandra default):
  Khi có N SSTables cùng size → merge thành 1 SSTable lớn hơn
  Pro: ít write amplification
  Con: cần 2x space trong quá trình compact (space amplification cao)

Leveled Compaction (RocksDB, LevelDB default):
  L0 → L1 → L2... mỗi level 10x lớn hơn, key ranges không overlap
  Pro: read hiệu quả hơn (chỉ check 1 SSTable/level)
  Con: write amplification cao hơn (WAF ~10-30x)

FIFO Compaction:
  Delete oldest SSTables khi đầy
  Use case: time-series data với TTL

4. B-Tree vs LSM-Tree

                    B-Tree              LSM-Tree
─────────────────────────────────────────────────────────────────
Write              Random I/O          Sequential I/O (nhanh hơn)
Read               O(log n), tốt       Phải check nhiều levels
Write Amplif.      Thấp-TB             Cao (compaction)
Read Amplif.       Thấp                Cao (nhiều SSTables)
Space Amplif.      Thấp                Cao (redundant data)
Update             In-place            Append new version
Delete             In-place            Tombstone marker
Range query        Tốt (leaf linked)   Tốt (sorted)
Use case           OLTP, read-heavy    Write-heavy, time-series

Databases:
  B-Tree:  MySQL (InnoDB), PostgreSQL, Oracle, SQL Server
  LSM:     Cassandra, RocksDB, LevelDB, ScyllaDB, TiKV

5. Indexing — Chi tiết

5.1 Clustered vs Non-Clustered Index

Clustered Index (InnoDB): data rows được lưu trực tiếp trong B-Tree leaf nodes, sắp xếp vật lý theo primary key. Tìm theo PK chỉ cần 1 B-Tree traversal — data nằm ngay tại đó. Mỗi table chỉ có 1 clustered index.

B-Tree Leaf Node:
  [PK=1 | name="Alice" | age=30 | email="a@b.com"]
  [PK=2 | name="Bob"   | age=25 | email="b@c.com"]

Non-Clustered Index (Secondary Index): leaf nodes chỉ chứa [index_key | primary_key]. Tìm row cần 2 bước: tìm PK trong secondary index → dùng PK tìm row trong clustered index. Gọi là "double lookup".

Secondary Index on (age):
  [age=25 | PK=2] → lookup PK=2 trong clustered index
  [age=30 | PK=1] → lookup PK=1

PostgreSQL khác InnoDB: không có clustered index. Data rows nằm trong heap files (unordered), tất cả indexes đều non-clustered và trỏ đến heap TID.

5.2 Composite Index — Thứ tự quan trọng!

Index on (last_name, first_name, age):

Dùng được:
  WHERE last_name = 'Smith'                              ✅ (prefix)
  WHERE last_name = 'Smith' AND first_name = 'John'     ✅
  WHERE last_name = 'Smith' AND age = 30                ✅ (nhưng age không dùng index)
  ORDER BY last_name, first_name                        ✅

Không dùng được:
  WHERE first_name = 'John'                             ❌ (bỏ qua prefix)
  WHERE age = 30                                        ❌
  ORDER BY first_name                                   ❌

Quy tắc Leftmost Prefix: index hữu ích khi query bắt đầu từ column đầu tiên. Thứ tự columns trong composite index ảnh hưởng trực tiếp đến hiệu quả.

Đặt column có cardinality cao (nhiều unique values) lên trước: (user_id, status) tốt hơn (status, user_id) vì status chỉ có vài values, user_id có hàng triệu.

5.3 Index Types

Hash Index: O(1) exact lookup, nhưng không hỗ trợ range queries hay ORDER BY. Chủ yếu dùng cho in-memory hash tables.

Full-Text Index: inverted index ánh xạ từng từ đến danh sách documents chứa từ đó. Hỗ trợ relevance ranking (TF-IDF, BM25).

Partial Index (PostgreSQL): chỉ index rows thỏa điều kiện — index nhỏ hơn, nhanh hơn:

CREATE INDEX idx ON orders(status) WHERE status = 'pending';

Covering Index: index chứa tất cả columns mà query cần → không cần lookup table (index-only scan):

CREATE INDEX idx ON users(email, name, age);
SELECT name, age FROM users WHERE email = 'x@y.com';
-- Chỉ đọc index, không đụng đến table

GiST / GIN Index (PostgreSQL):

GiST: Generalized Search Tree — spatial data, ranges, full-text
GIN: Generalized Inverted Index — arrays, JSONB, full-text
BRIN: Block Range Index — time-series, append-only tables (rất nhỏ)

5.4 Khi nào DB không dùng index?

Biết các trường hợp này để tránh viết query bị full scan:

1. Low selectivity:
   WHERE status IN ('active', 'inactive')
   → Full table scan nhanh hơn nếu phải fetch > ~5-10% rows

2. Function trên column:
   WHERE YEAR(created_at) = 2024         ❌  (index không dùng được)
   WHERE created_at >= '2024-01-01'       ✅  (sử dụng được index)

3. Implicit type conversion:
   WHERE user_id = '123'  (user_id là INT, '123' là string)
   → DB phải convert toàn bộ index → không dùng được

4. Leading wildcard:
   WHERE name LIKE '%Alice'  ❌
   WHERE name LIKE 'Alice%'  ✅

5. OR conditions:
   WHERE a = 1 OR b = 2
   → Có thể cần 2 index scans + merge, đôi khi full scan nhanh hơn