Jan 01, 2026

5 min read

MCP Servers with Code Mode: The missing piece in Agentic AI

written by Aman Kumar Nirala

Futuristic illustration showing an AI chip at the center connected by glowing data lines to code blocks on the left and server racks on the right, with a translucent cube representing a sandbox or execution layer in between.

The Scaling Problem with Traditional Tool Calling

The Model Context Protocol, announced in November 2024, standardized how LLMs interact with external systems: you define tools as JSON schemas, the model outputs structured calls, your harness executes them, and results flow back to the model. MCP gives agents a uniform way to discover tools, read their schemas, and invoke them with proper authorization.

But MCP’s rapid adoption exposed two scaling bottlenecks:

1. Tool definitions consume context. Each tool schema, with its description, parameters, types, and examples, takes up tokens. Connect an agent to 50 MCP servers with 10 tools each, and you’re loading 500 tool definitions before the model reads your first request.

2. Intermediate results waste tokens. Traditional tool calling routes every result through the model’s context. Fetch a document, filter it, transform it, send it elsewhere, each step burns tokens copying data the model doesn’t need to “think about.”

Enter Code Mode: A Different Approach

Cloudflare and Anthropic proposed something different: What if we convert MCP tool schemas into TypeScript interfaces and let the model write code to orchestrate them?

Two-panel diagram comparing “Traditional MCP” and “Code Mode.” In Traditional MCP, the MCP server provides tool schemas, the agent exposes matching functions, the LLM outputs special text sequences for function invocation, and the agent calls MCP tools. In Code Mode, the agent provides a TypeScript API, the LLM writes code against it, the code executes in a dynamic isolate sandbox, calls RPC bindings, and the agent calls MCP tools.

Source:https://blog.cloudflare.com/code-mode

The results are compelling:

Agents can handle far more tools. When tools are presented as a typed TypeScript API rather than JSON schemas, models navigate them more naturally. Perhaps because LLMs have seen billions of lines of real TypeScript code in training, but only thousands of synthetic tool-calling examples.

Multi-step workflows become dramatically more efficient. With traditional tool calling, every intermediate result flows through the model’s context, even when the model is just copying data from one tool to another. When the model writes code, it can process data in the execution environment and only surface final results. Token savings can reach 95%+ for data-heavy workflows.

Complex orchestration becomes natural. Loops, conditionals, error handling, and retry patterns that require awkward tool-calling chains become straightforward code. The model doesn’t need a “polling tool” or “conditional tool.” It just writes a while loop or an if-statement.

In short: LLMs are better at writing code to call MCP tools than at calling MCP tools directly.

Why It Works: LLMs Are Trained on Real Code, Not Tool Schemas

The core insight is simple: LLMs are excellent at generating idiomatic TypeScript. When you ask ChatGPT or Claude to write functions and glue code, they naturally express loops, branches, error-handling, and retry patterns they’ve seen millions of times in open source repositories.

code

Contrast this with tool calling. The special tokens that represent <|tool_call|> and <|end_tool_call|> don't exist in the wild. They're artifacts of fine-tuning, taught through synthetic training sets that pale in comparison to the corpus of actual code the model has ingested

Ask Shakespeare to write a play in Mandarin after a month-long crash course and you’ll get something functional but hardly his best work.

Code Mode translates MCP schemas into TypeScript interfaces: complete with doc comments, type signatures, and semantic context. The model sees this:

code

interface GetDocumentInput {
  documentId: string;
}

*/* Read a document from Google Drive */*
export async function getDocument(
  input: GetDocumentInput
): Promise<GetDocumentResponse>

Not this:

code

{
  "name": "get_document",
  "description": "Retrieves a document from Google Drive",
  "parameters": {
    "documentId": {"type": "string", "required": true}
  }
}

The first is familiar territory. The second is a contrived representation that the model has barely encountered.

Where It Shines: Multi-Step Workflows

Code Mode’s real power emerges when agents need to string together multiple operations. Consider a typical workflow: fetch a 10,000-row spreadsheet, filter for pending orders, and update Salesforce records.

With traditional tool calling, every step round-trips through the model’s context:

code

TOOL CALL: getSheet("abc123")
→ returns 10,000 rows [loaded into context]
TOOL CALL: filter rows where status = "pending"
→ returns 156 rows [loaded into context]
TOOL CALL: updateSalesforce(...)

Each intermediate result consumes tokens. For a two-hour meeting transcript flowing from Google Docs to Salesforce, you might burn 50,000 extra tokens just copying data between tools.

Code Mode collapses this:

code

const sheet = await gdrive.getSheet({sheetId: 'abc123'});
const pending = sheet.filter(row => row.status === 'pending');
for (const order of pending) {
  await salesforce.updateRecord({
    objectType: 'Order',
    recordId: order.id,
    data: {status: 'Processing'}
  });
}
console.log(`Updated ${pending.length} orders`);

The filtering happens in the sandbox. Only the final log line returns to the model. Token savings approach 98% for data-heavy workflows.

Beyond Efficiency: Natural Expressiveness

Code Mode also unlocks workflow patterns that are just too complicated with individual tool calls:

Polling and retries:

code

let deploymentComplete = false;
while (!deploymentComplete) {
  const status = await cicd.getDeploymentStatus({id: 'deploy-789'});
  deploymentComplete = status.state === 'success';
  if (!deploymentComplete) {
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}

Conditional logic:

code

const userProfile = await db.getUser({id: userId});
if (userProfile.tier === 'premium') {
  await notifications.sendEmail({
    to: userProfile.email,
    template: 'premium-welcome'
  });
} else {
  await notifications.sendSMS({
    to: userProfile.phone,
    message: 'Welcome to our service!'
  });
}

Error handling:

code

try {
  const report = await analytics.generateReport({dateRange: 'last-week'});
  await slack.postMessage({channel: '#sales', text: report});
} catch (error) {
  await slack.postMessage({
    channel: '#engineering', 
    text: `Report generation failed: ${error.message}`
  });
}

With traditional tool calling, you’d need separate “conditional_tool,” “retry_tool,” or “error_handler_tool” abstractions or risk contextual drift from dumping complete tool call results in the LLM’s context. With code, you just write standard control flow.

The Costs: Engineering Overhead and New Failure Modes

Code Mode isn’t free. You’re trading the simplicity of direct tool calls for a more complex execution model.

Infrastructure requirements: You need a generator that converts MCP schemas to typed clients, a hardened sandbox with resource limits and monitoring, and observability for generated code. Cloudflare uses V8 isolates, lightweight and disposable, but most teams will reach for container-based solutions with higher overhead.

Security surface: Agent-generated code is, by definition, untrusted. Your sandbox needs to prohibit arbitrary network access, prevent filesystem escape, and enforce execution timeouts. Bindings help by providing authorized interfaces without exposing API keys, but you’re still running dynamic code from an LLM.

Debugging complexity: When a tool call fails, you have JSON and an error message. When code execution fails, you have a stacktrace in a sandbox, potentially spanning multiple async operations. Observability becomes critical but harder to implement.

Operational risk: Code Mode introduces more moving parts. Schema generation can break. Sandboxes can leak resources. Generated code can hit edge cases that your testing missed. Teams accustomed to deterministic tool-call patterns will find Code Mode requires more sophisticated monitoring.

The Verdict: Adopt Judiciously

Code Mode is practical and elegant for the right use cases. Adopt it where:

You’re orchestrating multi-step workflows with significant intermediate data
Token costs and latency matter to your business model
Your organization has the engineering capacity for secure sandboxing
Debugging generated code is acceptable overhead

Stick with traditional tool calling where:

Most queries involve single tool lookups
Your team lacks infrastructure for safe code execution
Operational simplicity trumps efficiency gains
Security requirements make dynamic code execution untenable

The irony, as I have already noted, is that we’ve built layers of abstraction to teach LLMs what documentation already expresses. MCP schemas get transpiled back into code interfaces because models are better at reading code than JSON. We’ve circumnavigated the globe to reach our neighbor’s house.

But the journey matters. MCP provided standardization. Code Mode provides efficiency. Together, they enable agents to handle hundreds of tools without drowning in token costs with direct MCP specs in the context.

The real question isn’t whether CodeMode is elegant. It’s whether the operational complexity is worth the token savings and orchestration power for your specific use case. Choose wisely.

Aman Kumar Nirala

Machine Learning Engineer

I’m a Machine Learning Engineer exploring the edges of software, applied research, and emerging fields like Agentic AI. I enjoy building AI systems that move from idea to reality and actually solve problems that matter.

Contrast this with tool calling. The special tokens that represent <|tool_call|> and <|end_tool_call|> don't exist in the wild. They're artifacts of fine-tuning, taught through synthetic training sets that pale in comparison to the corpus of actual code the model has ingested

TOOL CALL: getSheet("abc123") → returns 10,000 rows [loaded into context] TOOL CALL: filter rows where status = "pending" → returns 156 rows [loaded into context] TOOL CALL: updateSalesforce(...)

let deploymentComplete = false; while (!deploymentComplete) { const status = await cicd.getDeploymentStatus({id: 'deploy-789'}); deploymentComplete = status.state === 'success'; if (!deploymentComplete) { await new Promise(resolve => setTimeout(resolve, 5000)); } }

const userProfile = await db.getUser({id: userId}); if (userProfile.tier === 'premium') { await notifications.sendEmail({ to: userProfile.email, template: 'premium-welcome' }); } else { await notifications.sendSMS({ to: userProfile.phone, message: 'Welcome to our service!' }); }

try { const report = await analytics.generateReport({dateRange: 'last-week'}); await slack.postMessage({channel: '#sales', text: report}); } catch (error) { await slack.postMessage({ channel: '#engineering', text: `Report generation failed: ${error.message}` }); }

MCP Servers with Code Mode: The missing piece in Agentic AI

The Scaling Problem with Traditional Tool Calling

Enter Code Mode: A Different Approach

The results are compelling:

Why It Works: LLMs Are Trained on Real Code, Not Tool Schemas

Where It Shines: Multi-Step Workflows

Beyond Efficiency: Natural Expressiveness

The Costs: Engineering Overhead and New Failure Modes

The Verdict: Adopt Judiciously

Aman Kumar Nirala

Aman Kumar Nirala

Categories

You might also like

The Irony of MCP: Layers of Abstraction to Teach LLMs What We Already Knew

The Foundation of RAG Systems: Architecture, Pipeline & Performance

+91-9952077590

ahoy@synergyboat.com

India

Get in Touch

MCP Servers with Code Mode: The missing piece in Agentic AI

The Scaling Problem with Traditional Tool Calling

Enter Code Mode: A Different Approach

The results are compelling:

Why It Works: LLMs Are Trained on Real Code, Not Tool Schemas

Where It Shines: Multi-Step Workflows

Beyond Efficiency: Natural Expressiveness

The Costs: Engineering Overhead and New Failure Modes

The Verdict: Adopt Judiciously

Aman Kumar Nirala

Aman Kumar Nirala

Categories

You might also like

The Irony of MCP: Layers of Abstraction to Teach LLMs What We Already Knew

The Foundation of RAG Systems: Architecture, Pipeline & Performance