🗄️ Database✍️ Khoa📅 19/04/2026☕ 4 phút đọc

Database: Interview và Big Picture

14. Câu hỏi phỏng vấn Senior hay gặp

Q: B-Tree vs LSM-Tree — chọn khi nào?

B-Tree tốt cho read-heavy workloads và mixed workloads vì read performance tốt, low read amplification. LSM tốt cho write-heavy workloads (time-series, logs, event streams) vì sequential writes nhanh hơn random writes. Ngoài ra LSM có space amplification cao hơn do compaction. Ví dụ: MySQL/PostgreSQL (OLTP, balanced) vs Cassandra/RocksDB (write-heavy, append-heavy).

Q: Tại sao MVCC giúp "writers không block readers"?

Với locking thuần túy, writer đặt exclusive lock → reader phải chờ. Với MVCC, writer tạo version mới của row thay vì modify in-place. Reader đọc version cũ (theo snapshot của mình), không cần chờ writer hoàn thành. Writer không cần biết reader đang đọc version cũ. Kết quả: throughput cao hơn, latency thấp hơn, đặc biệt với read-heavy workloads.

Q: Repeatable Read có ngăn Phantom Read không?

Phụ thuộc vào implementation. Theo SQL standard: Repeatable Read KHÔNG ngăn Phantom Read. MySQL InnoDB: ngăn được nhờ next-key locking (gap locks ngăn INSERT vào khoảng đang được query). PostgreSQL: dùng SSI (Serializable Snapshot Isolation) approach, Repeatable Read thực chất là snapshot isolation — ngăn phantom trong most cases nhưng không phải tất cả (write skew vẫn có thể xảy ra).

Q: Tại sao index có thể làm chậm writes?

Mỗi INSERT/UPDATE/DELETE phải update tất cả indexes trên table. Với 5 indexes, 1 INSERT thực ra phải write 6 lần (1 data + 5 indexes). Ngoài ra B-Tree page splits khi insert vào full node tốn kém. Đó là lý do: drop indexes trước khi bulk insert, rồi rebuild sau.

Q: Shard key chọn như thế nào?

Tiêu chí: (1) Cardinality cao — đủ nhiều unique values để distribute đều; (2) Write distribution — không tập trung vào 1 shard (tránh monotonically increasing key như auto-increment); (3) Query locality — các queries liên quan nên đụng ít shards; (4) Immutability — shard key không nên thay đổi (đổi = phải move data). Ví dụ: user_id tốt hơn created_at (time-based hotspot), tenant_id tốt cho multi-tenant SaaS.

Q: Phân biệt Isolation Level trong PostgreSQL vs MySQL?

MySQL InnoDB default: Repeatable Read — dùng consistent read view + gap locks. PostgreSQL default: Read Committed — mỗi statement thấy snapshot riêng. PostgreSQL Repeatable Read thực chất là Snapshot Isolation (tốt hơn SQL standard). PostgreSQL Serializable dùng SSI (Serializable Snapshot Isolation) — optimistic, detect conflicts at commit time, ít blocking hơn 2PL. MySQL Serializable dùng 2PL (strict locking).

Q: Tại sao VACUUM quan trọng trong PostgreSQL?

PostgreSQL MVCC giữ old row versions trong table (dead tuples). Không VACUUM → table bloat (physical size phình to), index bloat, query chậm hơn (scan nhiều dead tuples). Nguy hiểm nhất: Transaction ID Wraparound — PostgreSQL dùng 32-bit txn_id, wraparound ~2 tỉ transactions. Nếu không freeze old txn_ids kịp → PostgreSQL tự chặn writes để tránh data corruption (emergency autovacuum).

Q: 2PC vs Saga — dùng gì trong microservices?

2PC: strong consistency, nhưng blocking protocol, coordinator SPOF, performance thấp khi cross nhiều services. Thực tế: chỉ dùng trong cùng 1 DB cluster hoặc XA transactions (hiếm). Saga: eventual consistency, không blocking, mỗi service independent. Trade-off: phải design compensating transactions, khó debug, không có isolation (các service khác có thể thấy intermediate state). Saga phù hợp hơn cho microservices architecture. Để xử lý isolation: semantic locks, pessimistic ordering, reread values.

🗺️ Bức tranh tổng thể

Application
    │
    ▼ SQL / ORM
Query Layer: Parser → Optimizer (statistics, join reorder) → Executor
    │
    ▼ Physical Plan
Storage Engine
    ├── B-Tree (read-optimized, in-place update)
    │     └── WAL (crash recovery, replication)
    └── LSM-Tree (write-optimized, append-only)
          └── MemTable → SSTable → Compaction
    │
    ▼ Concurrency Control
    ├── MVCC: multiple row versions (readers ≠ block writers)
    │     ├── PostgreSQL: dead tuples in heap → VACUUM
    │     └── InnoDB: undo log chain
    └── Locking: Record / Gap / Next-Key locks
    │
    ▼ Transactions
    ├── ACID: WAL (durability) + undo log (atomicity) + locks/MVCC (isolation)
    └── Isolation Levels: Read Committed → Serializable
    │
    ▼ Scale Out
    ├── Replication: async/sync, WAL shipping, logical
    │     └── Failover: Patroni, Orchestrator
    └── Sharding: range / hash / consistent hashing
          └── Cross-shard: 2PC (strong) or Saga (eventual)

Tài liệu dành cho Senior Backend / Database Engineer Interview Preparation Covers: B-Tree · LSM-Tree · Indexing · Query Planner · ACID · Isolation Levels · MVCC · Replication · Sharding · Distributed Transactions