🤖 AI✍️ Khoa📅 19/04/2026☕ 10 phút đọc

Agentic Systems & Tool Calling Architecture

1. Agent là gì? — Định nghĩa Engineering

Hype xung quanh "AI Agent" rất nhiều nhưng bản chất kỹ thuật cực kỳ đơn giản. Một Agent, xét từ góc độ code, chỉ là:

While task_not_complete:
    action = LLM.decide_next_action(current_state)
    if action == "use_tool":
        result = execute_tool(action.tool_name, action.arguments)
        current_state.append(result)
    elif action == "final_answer":
        return action.content
    else:
        break  # max_iterations reached

Đó. Đó là toàn bộ bí mật của Devin, AutoGPT, hay bất kỳ "AI Agent" nào bạn đọc trên báo. Cái phức tạp nằm ở:

Cách LLM quyết định action nào là đúng (ReAct, Plan-and-Solve, Tree-of-Thought)
Cách thiết kế Tool interface để model dùng đúng
Cách manage State và Memory qua nhiều vòng lặp
Cách xử lý failure, retry, và circuit-breaking

2. Tool Calling / Function Calling — Cơ chế nội tại

Đây là feature được baked-in vào OpenAI API từ GPT-4, Claude API từ Claude 3, và các hosted models khác. Nó không phải ma thuật — về nguyên lý là prompt engineering kết hợp với constrained decoding.

2.1 Cách khai báo Tool

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search the company database for employee information. "
                          "Use this when user asks about specific employees or HR data.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search query to find in database"
                    },
                    "table": {
                        "type": "string",
                        "enum": ["employees", "departments", "salaries"],
                        "description": "Which table to search"
                    },
                    "limit": {
                        "type": "integer",
                        "description": "Max results to return",
                        "default": 10
                    }
                },
                "required": ["query", "table"]
            }
        }
    }
]

2.2 Điều gì xảy ra bên trong model?

Request đến LLM:
  System: "You are a helpful assistant. You have access to: [tool schema JSON]"
  User:   "Tìm giúp tôi thông tin nhân viên Nguyễn Văn A"

Model đọc schema → nhận ra có tool "search_database" phù hợp
Model generate output (KHÔNG phải text thường):
  {
    "tool_use": {
      "name": "search_database",
      "arguments": {
        "query": "Nguyễn Văn A",
        "table": "employees"
      }
    }
  }

→ API trả về response với stop_reason = "tool_use"
→ Code của bạn intercept, execute actual function
→ Nhét kết quả vào conversation history
→ Gọi lại LLM với context mới
→ Model đọc kết quả, synthesis → trả lời user

2.3 Constrained Decoding — Tại sao JSON không bị malformed?

Một lo ngại hợp lý: LLM là probabilistic model, sao nó sinh ra JSON chuẩn format? Câu trả lời là qua Grammar-constrained decoding (hay Logit Bias):

Thời điểm model đang sinh phần "table":
  Logit vector trước softmax:
    "employees": 8.5
    "departments": 7.2
    "salaries": 6.8
    "payroll": 6.5    ← nằm ngoài enum!
    "foo":    2.1     ← nằm ngoài enum!

Grammar constraint nhận biết context hiện tại = giá trị của field "table"
→ Mask tất cả tokens không thuộc enum → set logit về -inf
→ Softmax chỉ áp dụng trên {"employees", "departments", "salaries"}
→ Model KHÔNG THỂ generate token ngoài danh sách này

Libraries như outlines, lm-format-enforcer, hay built-in của vLLM implement tính năng này qua Pushdown Automaton — parse JSON schema thành state machine, tại mỗi state chỉ allow transitions hợp lệ.

3. ReAct Pattern — Kiến trúc nền tảng

ReAct (Reasoning + Acting) là paper quan trọng nhất trong lĩnh vực Agents (Yao et al., 2023). Ý tưởng: xen kẽ bước suy nghĩ (Thought) và hành động (Action) thay vì ép model ra answer ngay.

3.1 ReAct Prompt Structure

System prompt:
  "Answer questions by reasoning step by step. 
   Use this format:
   Thought: [your reasoning about what to do next]
   Action: [tool_name]
   Action Input: [input to the tool]
   Observation: [result from tool]
   ... (repeat Thought/Action/Observation as needed)
   Thought: I now have enough information
   Final Answer: [your final answer]"

User: "Dân số Việt Nam nhân với GDP per capita là bao nhiêu?"

Model trace:
  Thought: Tôi cần tìm dân số Việt Nam và GDP per capita trước rồi mới tính được.
  Action: web_search
  Action Input: "Vietnam population 2024"
  Observation: 98.2 million (2024)

  Thought: Đã có dân số. Giờ cần GDP per capita.
  Action: web_search
  Action Input: "Vietnam GDP per capita 2024 USD"
  Observation: $4,347 (2024 estimate)

  Thought: Giờ tôi có đủ data. 98.2M × $4,347 = $426.7 billion.
  Final Answer: Tổng GDP Việt Nam ước tính khoảng $426.7 tỷ USD.

3.2 Vấn đề của ReAct: Hay bị lú (Hallucinating Actions)

ReAct không fail vì thiếu tool — nó fail vì model đôi khi "tưởng tượng" kết quả từ tool thay vì thực sự gọi tool. Dấu hiệu: model tự điền phần "Observation" trước khi bạn inject kết quả thật.

Giải pháp: Forced stopping — parse output của model theo regex, nếu thấy Action: thì dừng generation, gọi tool, inject Observation thực. Không để model tiếp tục sinh thêm gì.

4. Plan-and-Solve — Kiến trúc cho Complex Tasks

ReAct là sequential: quyết định từng bước. Plan-and-Solve tách thành 2 pha rõ ràng:

Phase 1 — Planning (Planner LLM):
  Input: "Viết report phân tích Q4 2024 của công ty, bao gồm revenue, 
          cost breakdown, và comparison với Q3"
  
  Output (structured plan):
  {
    "tasks": [
      {"id": 1, "name": "Fetch Q4 revenue data", "tool": "db_query", 
       "args": {"table": "revenue", "period": "Q4-2024"}},
      {"id": 2, "name": "Fetch Q3 revenue data", "tool": "db_query",
       "args": {"table": "revenue", "period": "Q3-2024"}, "depends_on": []},
      {"id": 3, "name": "Fetch cost breakdown Q4", "tool": "db_query",
       "args": {"table": "costs", "period": "Q4-2024"}},
      {"id": 4, "name": "Generate report", "tool": "llm_generate",
       "args": {"template": "quarterly_report"}, "depends_on": [1,2,3]}
    ]
  }

Phase 2 — Execution (Executor):
  - Chạy task 1, 2, 3 song song (vì không có dependency)
  - Đợi kết quả, chạy task 4
  - Return final report

Điểm mạnh: Task 1, 2, 3 có thể chạy parallel → latency giảm đáng kể so với ReAct sequential. Phù hợp với tasks có nhiều data-fetching steps.

5. Multi-Agent Architecture — Khi 1 Agent không đủ

5.1 Tại sao Multi-Agent?

Single agent với 1 system prompt dài 5000 tokens để xử lý mọi thứ:

Context window bị chiếm nhiều bởi instructions → ít chỗ cho actual data
Model bị confused khi nhiều roles trong 1 prompt ("Bạn vừa là designer vừa là coder vừa là QA...")
Khó debug khi fail

Multi-agent: mỗi agent nhỏ, focused, system prompt ngắn gọn:

flowchart TB
  O["Orchestrator Agent"] --> R["Research Agent"]
  O --> C["Code Agent"]
  O --> K["Critic Agent"]
  R --> RT["Tools: web_search, read_url"]
  C --> CT["Tools: python_repl, write_file"]
  K --> KT["Tools: read_file, run_tests"]

Orchestrator Agent
      │
      ├── Research Agent   (system: "Bạn chỉ search web và trả về summary")
      │         │
      │         └── Tools: [web_search, read_url]
      │
      ├── Code Agent       (system: "Bạn chỉ viết và execute code Python")
      │         │
      │         └── Tools: [python_repl, write_file]
      │
      └── Critic Agent     (system: "Bạn review code và report bugs")
                │
                └── Tools: [read_file, run_tests]

5.2 Communication Protocols

a) Shared Memory (Blackboard Pattern):

Agents giao tiếp qua shared state (Redis / DB):
  Research Agent writes: {"findings": [...], "status": "complete"}
  Code Agent reads: findings, writes code to shared storage
  Critic Agent reads: code, writes review back

Pros: Loose coupling, agents có thể chạy async
Cons: Race conditions, cần careful locking

b) Direct Handoff (Sequential Chain):

Orchestrator → Research Agent → [waits for result] → Code Agent → [waits] → Critic
Pros: Simple, deterministic, easy to debug
Cons: Latency = sum of all steps, no parallelism

c) Event-driven (Message Queue):

Research Agent publish: {topic: "research_done", payload: {...}}
Code Agent subscribe: "research_done" → trigger automatically
Critic Agent subscribe: "code_ready" → trigger

Kiến trúc này scale tốt, nhưng debugging flow phức tạp (distributed tracing required)

5.3 Reflection / Self-Correction Pattern

Đây là pattern đằng sau Aider, SWE-agent, và nhiều coding assistants:

flowchart TB
  U["User Task"] --> A["Actor LLM"]
  A --> G["Generate Code/Plan/Answer"]
  G --> X["Execute in sandbox"]
  X --> D{"SUCCESS / FAILURE"}
  D -->|SUCCESS| RR["Return result"]
  D -->|FAILURE| E["Evaluator LLM"]
  E --> C["Critique & suggest fix"]
  C --> A
  C --> GU["Give up (max N retries)"]

┌────────────────────────────────────────────────────────────┐
│                    Reflection Loop                          │
│                                                             │
│  User Task                                                  │
│     │                                                       │
│     ▼                                                       │
│  [Actor LLM] ──generate──► [Code / Plan / Answer]          │
│                                     │                       │
│                              execute in sandbox             │
│                                     │                       │
│                    ┌────────────────┴───────┐               │
│                    │                        │               │
│                 SUCCESS                   FAILURE           │
│                    │                   (error, test fail)   │
│                    ▼                        │               │
│              Return result             [Evaluator LLM]      │
│                                             │               │
│                                    critique & suggest fix   │
│                                             │               │
│                                    [Actor LLM] retry ───────┤
│                                             │    (max N)    │
│                                        Give up              │
└────────────────────────────────────────────────────────────┘

Key implementation detail: Evaluator không chỉ nhận được error message, nó nhận được full context: original task, code written, exact error traceback, test output. Càng nhiều context → critique càng chính xác → fix rate cao hơn.

6. Memory Architecture — Giải quyết "Context Window Hell"

6.1 Sliding Window Memory (Đơn giản, hay dùng)

MAX_MESSAGES = 20
conversation_history = deque(maxlen=MAX_MESSAGES)

def add_message(role, content):
    conversation_history.append({"role": role, "content": content})
    # Khi deque đầy, message cũ nhất tự bị xóa

# Luôn giữ system prompt ở đầu
messages = [system_prompt] + list(conversation_history)

Vấn đề: Nếu thông tin quan trọng xuất hiện ở message thứ 5 và conversation đã đến message 100, thông tin đó bị forget.

6.2 Summary Memory (Context Compression)

Khi conversation_history vượt N messages:
  1. Chạy LLM summarize các messages cũ:
     "Summarize this conversation history concisely, 
      keeping key decisions, facts, and context."
  2. Thay thế N messages cũ bằng 1 summary message
  3. Tiếp tục conversation với summary + recent messages

Message structure sau compression:
  [System Prompt]
  [Summary: "User is building a RAG system for company HR docs. 
             Decided to use pgvector. Currently implementing chunking."]
  [Recent 10 messages...]

6.3 Long-term Memory với Vector Store

Persistent memory qua nhiều sessions:

class AgentMemory:
    def __init__(self):
        self.vector_store = ChromaDB()
        self.embedding_model = ...

    def save(self, content: str, metadata: dict):
        """Lưu memory item vào long-term store"""
        embedding = self.embedding_model.encode(content)
        self.vector_store.upsert(
            embedding=embedding,
            text=content,
            metadata={**metadata, "timestamp": time.now()}
        )

    def recall(self, query: str, top_k: int = 5) -> list[str]:
        """Retrieve relevant memories cho current context"""
        query_embedding = self.embedding_model.encode(query)
        results = self.vector_store.search(query_embedding, top_k=top_k)
        return [r.text for r in results]

# Usage trong Agent:
relevant_memories = memory.recall(current_user_query)
system_prompt += f"\n\nRelevant context from past interactions:\n{relevant_memories}"

Đây chính xác là cơ chế agent memory của Claude kết hợp với project memory — lưu observation quan trọng vào file-based store, load lại ở đầu session mới.

7. Failure Modes và Production Concerns

7.1 Infinite Loops

MAX_ITERATIONS = 10  # LUÔN có hard limit
iteration = 0

while not done:
    if iteration >= MAX_ITERATIONS:
        return "Agent reached maximum iterations. Task incomplete."
    
    action = llm.decide(state)
    state = execute(action)
    iteration += 1
    done = check_completion(state)

7.2 Cost Control

# Track token usage per agent run
total_tokens = 0
MAX_TOKENS_PER_RUN = 50_000  # → ~$1.5 với GPT-4

response = llm.call(messages, tools=tools)
total_tokens += response.usage.total_tokens

if total_tokens > MAX_TOKENS_PER_RUN:
    raise AgentBudgetExceededException(f"Used {total_tokens} tokens")

7.3 Tool Execution Safety

Cực kỳ quan trọng — không bao giờ để model gọi tool với destructive operations mà không có human confirmation:

DANGEROUS_TOOLS = ["delete_file", "drop_table", "send_email_bulk"]

def execute_tool(tool_name: str, args: dict) -> str:
    if tool_name in DANGEROUS_TOOLS:
        # Pause và yêu cầu human confirmation
        confirmed = request_human_approval(tool_name, args)
        if not confirmed:
            return "Action cancelled by user"
    
    # Sandbox execution cho code tools
    if tool_name == "python_repl":
        return run_in_docker_sandbox(args["code"], timeout=30)
    
    return registry[tool_name](**args)

7.4 Observability cho Agents

Agent runs cần được trace đầy đủ để debug:

{
  "run_id": "agent_run_abc123",
  "start_time": "2024-01-15T10:30:00Z",
  "iterations": [
    {
      "iteration": 1,
      "thought": "I need to search for...",
      "action": "web_search",
      "action_input": {"query": "..."},
      "observation": "...",
      "tokens_used": 1250,
      "latency_ms": 850
    }
  ],
  "total_tokens": 8420,
  "total_cost_usd": 0.087,
  "final_answer": "...",
  "status": "completed"
}

LangSmith, LangFuse, hay Weights & Biases đều support agent tracing. Tự build cũng không khó — chỉ cần log mỗi iteration vào DB.