🗄️ Database✍️ Khoa📅 19/04/2026☕ 7 phút đọc

Data Engineering Fundamentals — Backend Engineer Cũng Cần Biết

"Data is the new oil" — đúng, nhưng oil thô thì vô dụng. Data engineering biến crude data thành refined insights. Và nếu bạn là Staff Engineer mà không hiểu data pipeline, bạn đang blind spot ở 1/3 mọi architecture discussion.

Bạn không cần trở thành Data Engineer. Nhưng bạn cần đủ kiến thức để collaborate hiệu quả với data team, thiết kế systems tạo ra data đúng, và understand trade-offs khi data pipeline involve vào architecture.

1. OLTP vs OLAP — Hai thế giới khác nhau

OLTP (Online Transaction Processing):
  → Bạn đang dùng hàng ngày: PostgreSQL, MySQL, MongoDB
  → Optimized cho: single-row operations (INSERT, UPDATE, SELECT by ID)
  → Pattern: nhiều small transactions, low latency
  → Schema: normalized (3NF), avoid redundancy
  → Ví dụ: Order service, User service, Payment

OLAP (Online Analytical Processing):
  → ClickHouse, BigQuery, Redshift, Snowflake
  → Optimized cho: aggregate queries (SUM, COUNT, GROUP BY over millions of rows)
  → Pattern: ít queries nhưng scan HÀNG TRIỆU rows
  → Schema: denormalized (star/snowflake schema)
  → Ví dụ: "Revenue theo tháng theo region theo product category"

Kết hợp:
  OLTP (write) ──── ETL/CDC ────→ OLAP (read/analyze)
  Production DB                   Data Warehouse

2. Data Pipeline Patterns

2.1 Batch Processing

Khi nào: Data không cần real-time. Giờ-level hoặc ngày-level đủ tốt.

┌─────────┐     ┌─────────┐     ┌──────────┐     ┌──────────┐
│ Source   │────→│ Extract │────→│Transform │────→│  Load    │
│ (DB/API) │     │         │     │          │     │ (DW/Lake)│
└─────────┘     └─────────┘     └──────────┘     └──────────┘
                    ETL Pipeline (chạy hàng giờ/ngày)

Tools: Apache Spark, Apache Beam, dbt, Airflow (orchestration)

Ưu điểm:
  → Đơn giản, dễ debug (rerun failed job)
  → Cost-effective (dùng spot instances)
  → Xử lý được data cực lớn (TB+ per batch)

Nhược điểm:
  → Latency cao (giờ → ngày)
  → "Stale data" problem

2.2 Stream Processing

Khi nào: Data cần near-real-time. Giây-level hoặc phút-level.

┌─────────┐     ┌─────────┐     ┌──────────┐     ┌──────────┐
│ Source   │────→│ Message │────→│ Stream   │────→│  Sink    │
│ (Events) │     │ Broker  │     │ Processor│     │(DB/Alert)│
└─────────┘     └─────────┘     └──────────┘     └──────────┘
                 Kafka/Pulsar   Flink/KStreams

Tools: 
  Message Broker: Apache Kafka, Apache Pulsar, AWS Kinesis
  Stream Processor: Apache Flink, Kafka Streams, Spark Streaming

Ưu điểm:
  → Near real-time (seconds latency)
  → Continuous processing, không "batch window"
  → Event-driven architecture natural fit

Nhược điểm:
  → Complex: state management, exactly-once, ordering
  → Harder to debug (no "rerun this batch")
  → Infrastructure overhead

2.3 Lambda Architecture

Lambda = Batch + Streaming song song:

                    ┌── Batch Layer ──→ Batch Views
  Source ──→ Queue ─┤                                  ──→ Serving Layer
                    └── Speed Layer ──→ Real-time Views

Batch: Accurate nhưng chậm (run mỗi giờ)
Speed: Fast nhưng approximate (stream processing)
Serving: Merge cả 2, ưu tiên batch khi available

Ưu điểm: Accuracy + Speed
Nhược điểm: Maintain 2 pipelines = 2x complexity

Thực tế: Nhiều team chuyển sang Kappa Architecture

2.4 Kappa Architecture

Kappa = Chỉ dùng streaming, bỏ batch layer:

  Source ──→ Kafka ──→ Stream Processor ──→ Serving Layer

"Nếu cần reprocess: replay Kafka topic từ đầu"

Ưu điểm: 1 pipeline, đơn giản hơn Lambda
Nhược điểm: Kafka retention phải đủ dài, replay tốn thời gian
Khi nào dùng: Event-driven systems, log processing

3. Change Data Capture (CDC) — Bridge OLTP ↔ OLAP

CDC = Capture changes từ database → stream ra ngoài

  PostgreSQL ──(WAL)──→ Debezium ──→ Kafka ──→ DW/Search/Cache

Tại sao cần CDC (thay vì query DB trực tiếp):
  → Không tạo thêm load lên production DB
  → Near real-time (seconds, không phải giờ)
  → Capture INSERT + UPDATE + DELETE (full history)
  → Decouple systems: downstream consumers independent

Tools:
  Debezium: Open source, Kafka Connect based
  AWS DMS: Managed, AWS ecosystem
  Fivetran: SaaS, 300+ connectors

Ví dụ use cases:
  → Sync PostgreSQL → Elasticsearch (search)
  → Sync orders DB → analytics warehouse
  → Sync user changes → cache invalidation
  → Event sourcing from existing DB

4. Data Storage Tiers

4.1 Data Warehouse vs Data Lake vs Lakehouse

Data Warehouse (Snowflake, BigQuery, Redshift):
  → Structured data only (schema-on-write)
  → SQL interface
  → Optimized cho BI queries
  → Expensive per TB
  → "Department store: organized, expensive, everything có chỗ"

Data Lake (S3 + Athena, GCS + BigQuery):
  → Structured + unstructured (schema-on-read)
  → Raw data dump (JSON, Parquet, images, logs)
  → Cheap storage (S3: $0.023/GB/month)
  → Risk: "data swamp" nếu không governance
  → "Kho hàng: rẻ, dump mọi thứ vào, tìm đồ thì khổ"

Data Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi):
  → Best of both: cheap storage + ACID transactions
  → Schema enforcement + evolution
  → Time travel (query data at past point in time)
  → Unified batch + streaming
  → "IKEA: organized warehouse with good catalog"

4.2 File Formats

CSV:  Human-readable, slow, no types, no compression
JSON: Flexible, verbose, slow for analytics
Parquet: Columnar, compressed, fast aggregations ✅
ORC: Like Parquet, Hive ecosystem
Avro: Row-based, schema evolution, good for streaming

Rule of thumb:
  → Streaming/messaging: Avro hoặc Protobuf
  → Analytics/warehouse: Parquet
  → Interchange: JSON
  → Never: CSV (trừ Excel export cho business 😅)

5. Data Quality & Schema Evolution

5.1 Data Contracts

Problem: Backend team đổi field name → analytics pipeline break

Data Contract = Agreement giữa producer và consumer:
  → Schema definition (field names, types, constraints)
  → SLA (freshness, completeness, accuracy)
  → Change process (versioning, deprecation notice)
  → Owner & contact

Ví dụ (protobuf-style):
  message OrderEvent {
    string order_id = 1;       // required, UUID format
    string user_id = 2;        // required
    int64 amount_cents = 3;    // required, positive
    string currency = 4;       // required, ISO 4217
    Timestamp created_at = 5;  // required
    // Deprecated: use amount_cents instead
    // double amount = 6 [deprecated = true];
  }

5.2 Schema Registry

Apache Schema Registry (Confluent):
  → Central store cho schemas (Avro, Protobuf, JSON Schema)
  → Compatibility checks: BACKWARD, FORWARD, FULL
  → Versioned schemas
  → Producer đăng ký schema → Consumer validate

Compatibility modes:
  BACKWARD: New schema có thể read old data
    → OK: add optional field, remove field
    → NOT OK: add required field

  FORWARD: Old schema có thể read new data
    → OK: remove field, add optional field

  FULL: Both backward + forward compatible
    → Safest but most restrictive

6. Practical: Backend Engineer × Data Engineering

6.1 Design Events cho Analytics

// ❌ Sai: Event quá nghèo thông tin
type OrderEvent struct {
    OrderID string `json:"order_id"`
    Status  string `json:"status"`
}

// ✅ Đúng: Event giàu context cho analytics
type OrderEvent struct {
    // Identity
    EventID   string    `json:"event_id"`   // Idempotency
    EventType string    `json:"event_type"` // order.created
    EventTime time.Time `json:"event_time"` // When it happened
    
    // Entity
    OrderID    string `json:"order_id"`
    UserID     string `json:"user_id"`
    MerchantID string `json:"merchant_id"`
    
    // Business data
    TotalCents int64  `json:"total_cents"`
    Currency   string `json:"currency"`
    ItemCount  int    `json:"item_count"`
    
    // Context (gold cho analytics)
    Channel    string `json:"channel"`    // web, mobile, api
    Region     string `json:"region"`     // VN, SG, US
    DeviceType string `json:"device_type"`// ios, android, desktop
    
    // Metadata
    SchemaVersion string `json:"schema_version"` // v2
    ProducerID    string `json:"producer_id"`    // order-service
}

6.2 Khi nào involve Data team

Involve sớm khi:
  → Thay đổi database schema (data model change)
  → Thêm/xóa events (downstream pipelines depend)
  → Thay đổi business logic ảnh hưởng metrics
  → Migration có thể break CDC pipeline
  → New feature cần analytics tracking

Cách collaborate tốt:
  → Share data contract trước khi implement
  → Coordinate schema changes (deprecation period)
  → Run migration together (DB + pipeline)
  → Shared on-call cho data incidents

7. Tóm tắt

Data Engineering cho Backend Engineers:

  1. OLTP ≠ OLAP: Đừng chạy analytics trên production DB
  2. CDC: Bridge realtime giữa systems (Debezium + Kafka)
  3. Batch OK cho hầu hết: Đừng over-engineer streaming
  4. Parquet > CSV > JSON cho analytics data
  5. Data contracts: Agree on schema BEFORE coding
  6. Schema evolution: Backward compatible changes only
  7. Enrich events: Add context cho downstream analytics

Tài liệu tham khảo

Designing Data-Intensive Applications — Kleppmann (chương 10-12)
Fundamentals of Data Engineering — Reis & Housley
Debezium Documentation
Apache Kafka: The Definitive Guide
Delta Lake

💡 Remember: Bạn không cần build data pipelines. Bạn cần build systems mà data pipelines có thể consume dễ dàng. Design good events = gift cho data team. 🎁