Data Engineering Fundamentals — Backend Engineer Cũng Cần Biết
"Data is the new oil" — đúng, nhưng oil thô thì vô dụng. Data engineering biến crude data thành refined insights. Và nếu bạn là Staff Engineer mà không hiểu data pipeline, bạn đang blind spot ở 1/3 mọi architecture discussion.
Bạn không cần trở thành Data Engineer. Nhưng bạn cần đủ kiến thức để collaborate hiệu quả với data team, thiết kế systems tạo ra data đúng, và understand trade-offs khi data pipeline involve vào architecture.
1. OLTP vs OLAP — Hai thế giới khác nhau
OLTP (Online Transaction Processing):
→ Bạn đang dùng hàng ngày: PostgreSQL, MySQL, MongoDB
→ Optimized cho: single-row operations (INSERT, UPDATE, SELECT by ID)
→ Pattern: nhiều small transactions, low latency
→ Schema: normalized (3NF), avoid redundancy
→ Ví dụ: Order service, User service, Payment
OLAP (Online Analytical Processing):
→ ClickHouse, BigQuery, Redshift, Snowflake
→ Optimized cho: aggregate queries (SUM, COUNT, GROUP BY over millions of rows)
→ Pattern: ít queries nhưng scan HÀNG TRIỆU rows
→ Schema: denormalized (star/snowflake schema)
→ Ví dụ: "Revenue theo tháng theo region theo product category"
Kết hợp:
OLTP (write) ──── ETL/CDC ────→ OLAP (read/analyze)
Production DB Data Warehouse
2. Data Pipeline Patterns
2.1 Batch Processing
Khi nào: Data không cần real-time. Giờ-level hoặc ngày-level đủ tốt.
┌─────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐
│ Source │────→│ Extract │────→│Transform │────→│ Load │
│ (DB/API) │ │ │ │ │ │ (DW/Lake)│
└─────────┘ └─────────┘ └──────────┘ └──────────┘
ETL Pipeline (chạy hàng giờ/ngày)
Tools: Apache Spark, Apache Beam, dbt, Airflow (orchestration)
Ưu điểm:
→ Đơn giản, dễ debug (rerun failed job)
→ Cost-effective (dùng spot instances)
→ Xử lý được data cực lớn (TB+ per batch)
Nhược điểm:
→ Latency cao (giờ → ngày)
→ "Stale data" problem
2.2 Stream Processing
Khi nào: Data cần near-real-time. Giây-level hoặc phút-level.
┌─────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐
│ Source │────→│ Message │────→│ Stream │────→│ Sink │
│ (Events) │ │ Broker │ │ Processor│ │(DB/Alert)│
└─────────┘ └─────────┘ └──────────┘ └──────────┘
Kafka/Pulsar Flink/KStreams
Tools:
Message Broker: Apache Kafka, Apache Pulsar, AWS Kinesis
Stream Processor: Apache Flink, Kafka Streams, Spark Streaming
Ưu điểm:
→ Near real-time (seconds latency)
→ Continuous processing, không "batch window"
→ Event-driven architecture natural fit
Nhược điểm:
→ Complex: state management, exactly-once, ordering
→ Harder to debug (no "rerun this batch")
→ Infrastructure overhead
2.3 Lambda Architecture
Lambda = Batch + Streaming song song:
┌── Batch Layer ──→ Batch Views
Source ──→ Queue ─┤ ──→ Serving Layer
└── Speed Layer ──→ Real-time Views
Batch: Accurate nhưng chậm (run mỗi giờ)
Speed: Fast nhưng approximate (stream processing)
Serving: Merge cả 2, ưu tiên batch khi available
Ưu điểm: Accuracy + Speed
Nhược điểm: Maintain 2 pipelines = 2x complexity
Thực tế: Nhiều team chuyển sang Kappa Architecture
2.4 Kappa Architecture
Kappa = Chỉ dùng streaming, bỏ batch layer:
Source ──→ Kafka ──→ Stream Processor ──→ Serving Layer
"Nếu cần reprocess: replay Kafka topic từ đầu"
Ưu điểm: 1 pipeline, đơn giản hơn Lambda
Nhược điểm: Kafka retention phải đủ dài, replay tốn thời gian
Khi nào dùng: Event-driven systems, log processing
3. Change Data Capture (CDC) — Bridge OLTP ↔ OLAP
CDC = Capture changes từ database → stream ra ngoài
PostgreSQL ──(WAL)──→ Debezium ──→ Kafka ──→ DW/Search/Cache
Tại sao cần CDC (thay vì query DB trực tiếp):
→ Không tạo thêm load lên production DB
→ Near real-time (seconds, không phải giờ)
→ Capture INSERT + UPDATE + DELETE (full history)
→ Decouple systems: downstream consumers independent
Tools:
Debezium: Open source, Kafka Connect based
AWS DMS: Managed, AWS ecosystem
Fivetran: SaaS, 300+ connectors
Ví dụ use cases:
→ Sync PostgreSQL → Elasticsearch (search)
→ Sync orders DB → analytics warehouse
→ Sync user changes → cache invalidation
→ Event sourcing from existing DB
4. Data Storage Tiers
4.1 Data Warehouse vs Data Lake vs Lakehouse
Data Warehouse (Snowflake, BigQuery, Redshift):
→ Structured data only (schema-on-write)
→ SQL interface
→ Optimized cho BI queries
→ Expensive per TB
→ "Department store: organized, expensive, everything có chỗ"
Data Lake (S3 + Athena, GCS + BigQuery):
→ Structured + unstructured (schema-on-read)
→ Raw data dump (JSON, Parquet, images, logs)
→ Cheap storage (S3: $0.023/GB/month)
→ Risk: "data swamp" nếu không governance
→ "Kho hàng: rẻ, dump mọi thứ vào, tìm đồ thì khổ"
Data Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi):
→ Best of both: cheap storage + ACID transactions
→ Schema enforcement + evolution
→ Time travel (query data at past point in time)
→ Unified batch + streaming
→ "IKEA: organized warehouse with good catalog"
4.2 File Formats
CSV: Human-readable, slow, no types, no compression
JSON: Flexible, verbose, slow for analytics
Parquet: Columnar, compressed, fast aggregations ✅
ORC: Like Parquet, Hive ecosystem
Avro: Row-based, schema evolution, good for streaming
Rule of thumb:
→ Streaming/messaging: Avro hoặc Protobuf
→ Analytics/warehouse: Parquet
→ Interchange: JSON
→ Never: CSV (trừ Excel export cho business 😅)
5. Data Quality & Schema Evolution
5.1 Data Contracts
Problem: Backend team đổi field name → analytics pipeline break
Data Contract = Agreement giữa producer và consumer:
→ Schema definition (field names, types, constraints)
→ SLA (freshness, completeness, accuracy)
→ Change process (versioning, deprecation notice)
→ Owner & contact
Ví dụ (protobuf-style):
message OrderEvent {
string order_id = 1; // required, UUID format
string user_id = 2; // required
int64 amount_cents = 3; // required, positive
string currency = 4; // required, ISO 4217
Timestamp created_at = 5; // required
// Deprecated: use amount_cents instead
// double amount = 6 [deprecated = true];
}
5.2 Schema Registry
Apache Schema Registry (Confluent):
→ Central store cho schemas (Avro, Protobuf, JSON Schema)
→ Compatibility checks: BACKWARD, FORWARD, FULL
→ Versioned schemas
→ Producer đăng ký schema → Consumer validate
Compatibility modes:
BACKWARD: New schema có thể read old data
→ OK: add optional field, remove field
→ NOT OK: add required field
FORWARD: Old schema có thể read new data
→ OK: remove field, add optional field
FULL: Both backward + forward compatible
→ Safest but most restrictive
6. Practical: Backend Engineer × Data Engineering
6.1 Design Events cho Analytics
// ❌ Sai: Event quá nghèo thông tin
type OrderEvent struct {
OrderID string `json:"order_id"`
Status string `json:"status"`
}
// ✅ Đúng: Event giàu context cho analytics
type OrderEvent struct {
// Identity
EventID string `json:"event_id"` // Idempotency
EventType string `json:"event_type"` // order.created
EventTime time.Time `json:"event_time"` // When it happened
// Entity
OrderID string `json:"order_id"`
UserID string `json:"user_id"`
MerchantID string `json:"merchant_id"`
// Business data
TotalCents int64 `json:"total_cents"`
Currency string `json:"currency"`
ItemCount int `json:"item_count"`
// Context (gold cho analytics)
Channel string `json:"channel"` // web, mobile, api
Region string `json:"region"` // VN, SG, US
DeviceType string `json:"device_type"`// ios, android, desktop
// Metadata
SchemaVersion string `json:"schema_version"` // v2
ProducerID string `json:"producer_id"` // order-service
}
6.2 Khi nào involve Data team
Involve sớm khi:
→ Thay đổi database schema (data model change)
→ Thêm/xóa events (downstream pipelines depend)
→ Thay đổi business logic ảnh hưởng metrics
→ Migration có thể break CDC pipeline
→ New feature cần analytics tracking
Cách collaborate tốt:
→ Share data contract trước khi implement
→ Coordinate schema changes (deprecation period)
→ Run migration together (DB + pipeline)
→ Shared on-call cho data incidents
7. Tóm tắt
Data Engineering cho Backend Engineers:
1. OLTP ≠ OLAP: Đừng chạy analytics trên production DB
2. CDC: Bridge realtime giữa systems (Debezium + Kafka)
3. Batch OK cho hầu hết: Đừng over-engineer streaming
4. Parquet > CSV > JSON cho analytics data
5. Data contracts: Agree on schema BEFORE coding
6. Schema evolution: Backward compatible changes only
7. Enrich events: Add context cho downstream analytics
Tài liệu tham khảo
- Designing Data-Intensive Applications — Kleppmann (chương 10-12)
- Fundamentals of Data Engineering — Reis & Housley
- Debezium Documentation
- Apache Kafka: The Definitive Guide
- Delta Lake
💡 Remember: Bạn không cần build data pipelines. Bạn cần build systems mà data pipelines có thể consume dễ dàng. Design good events = gift cho data team. 🎁