MCP Servers with Code Mode: The missing piece in Agentic AI
A practical look at why MCP tool calling hits scaling limits, and how Code Mode (typed APIs + sandboxed execution) unlocks efficient multi-step agent workflows.
By Aman Kumar NiralaA practical look at why MCP tool calling hits scaling limits, and how Code Mode (typed APIs + sandboxed execution) unlocks efficient multi-step agent workflows.
By Aman Kumar NiralaJan 01, 2026
5 min read
written by Aman Kumar Nirala

The Model Context Protocol, announced in November 2024, standardized how LLMs interact with external systems: you define tools as JSON schemas, the model outputs structured calls, your harness executes them, and results flow back to the model. MCP gives agents a uniform way to discover tools, read their schemas, and invoke them with proper authorization.
But MCP’s rapid adoption exposed two scaling bottlenecks:
1. Tool definitions consume context. Each tool schema, with its description, parameters, types, and examples, takes up tokens. Connect an agent to 50 MCP servers with 10 tools each, and you’re loading 500 tool definitions before the model reads your first request.
2. Intermediate results waste tokens. Traditional tool calling routes every result through the model’s context. Fetch a document, filter it, transform it, send it elsewhere, each step burns tokens copying data the model doesn’t need to “think about.”
Cloudflare and Anthropic proposed something different: What if we convert MCP tool schemas into TypeScript interfaces and let the model write code to orchestrate them?

Source:https://blog.cloudflare.com/code-mode
Agents can handle far more tools. When tools are presented as a typed TypeScript API rather than JSON schemas, models navigate them more naturally. Perhaps because LLMs have seen billions of lines of real TypeScript code in training, but only thousands of synthetic tool-calling examples.
Multi-step workflows become dramatically more efficient. With traditional tool calling, every intermediate result flows through the model’s context, even when the model is just copying data from one tool to another. When the model writes code, it can process data in the execution environment and only surface final results. Token savings can reach 95%+ for data-heavy workflows.
Complex orchestration becomes natural. Loops, conditionals, error handling, and retry patterns that require awkward tool-calling chains become straightforward code. The model doesn’t need a “polling tool” or “conditional tool.” It just writes a while loop or an if-statement.
In short: LLMs are better at writing code to call MCP tools than at calling MCP tools directly.
The core insight is simple: LLMs are excellent at generating idiomatic TypeScript. When you ask ChatGPT or Claude to write functions and glue code, they naturally express loops, branches, error-handling, and retry patterns they’ve seen millions of times in open source repositories.
Contrast this with tool calling. The special tokens that represent <|tool_call|> and <|end_tool_call|> don't exist in the wild. They're artifacts of fine-tuning, taught through synthetic training sets that pale in comparison to the corpus of actual code the model has ingested
Ask Shakespeare to write a play in Mandarin after a month-long crash course and you’ll get something functional but hardly his best work.
Code Mode translates MCP schemas into TypeScript interfaces: complete with doc comments, type signatures, and semantic context. The model sees this:
interface GetDocumentInput {
documentId: string;
}
*/* Read a document from Google Drive */*
export async function getDocument(
input: GetDocumentInput
): Promise<GetDocumentResponse>
Not this:
{
"name": "get_document",
"description": "Retrieves a document from Google Drive",
"parameters": {
"documentId": {"type": "string", "required": true}
}
}
The first is familiar territory. The second is a contrived representation that the model has barely encountered.
Code Mode’s real power emerges when agents need to string together multiple operations. Consider a typical workflow: fetch a 10,000-row spreadsheet, filter for pending orders, and update Salesforce records.
With traditional tool calling, every step round-trips through the model’s context:
TOOL CALL: getSheet("abc123")
→ returns 10,000 rows [loaded into context]
TOOL CALL: filter rows where status = "pending"
→ returns 156 rows [loaded into context]
TOOL CALL: updateSalesforce(...)
Each intermediate result consumes tokens. For a two-hour meeting transcript flowing from Google Docs to Salesforce, you might burn 50,000 extra tokens just copying data between tools.
Code Mode collapses this:
const sheet = await gdrive.getSheet({sheetId: 'abc123'});
const pending = sheet.filter(row => row.status === 'pending');
for (const order of pending) {
await salesforce.updateRecord({
objectType: 'Order',
recordId: order.id,
data: {status: 'Processing'}
});
}
console.log(`Updated ${pending.length} orders`);
The filtering happens in the sandbox. Only the final log line returns to the model. Token savings approach 98% for data-heavy workflows.
Code Mode also unlocks workflow patterns that are just too complicated with individual tool calls:
Polling and retries:
let deploymentComplete = false;
while (!deploymentComplete) {
const status = await cicd.getDeploymentStatus({id: 'deploy-789'});
deploymentComplete = status.state === 'success';
if (!deploymentComplete) {
await new Promise(resolve => setTimeout(resolve, 5000));
}
}
Conditional logic:
const userProfile = await db.getUser({id: userId});
if (userProfile.tier === 'premium') {
await notifications.sendEmail({
to: userProfile.email,
template: 'premium-welcome'
});
} else {
await notifications.sendSMS({
to: userProfile.phone,
message: 'Welcome to our service!'
});
}
Error handling:
try {
const report = await analytics.generateReport({dateRange: 'last-week'});
await slack.postMessage({channel: '#sales', text: report});
} catch (error) {
await slack.postMessage({
channel: '#engineering',
text: `Report generation failed: ${error.message}`
});
}
With traditional tool calling, you’d need separate “conditional_tool,” “retry_tool,” or “error_handler_tool” abstractions or risk contextual drift from dumping complete tool call results in the LLM’s context. With code, you just write standard control flow.
Code Mode isn’t free. You’re trading the simplicity of direct tool calls for a more complex execution model.
Infrastructure requirements: You need a generator that converts MCP schemas to typed clients, a hardened sandbox with resource limits and monitoring, and observability for generated code. Cloudflare uses V8 isolates, lightweight and disposable, but most teams will reach for container-based solutions with higher overhead.
Security surface: Agent-generated code is, by definition, untrusted. Your sandbox needs to prohibit arbitrary network access, prevent filesystem escape, and enforce execution timeouts. Bindings help by providing authorized interfaces without exposing API keys, but you’re still running dynamic code from an LLM.
Debugging complexity: When a tool call fails, you have JSON and an error message. When code execution fails, you have a stacktrace in a sandbox, potentially spanning multiple async operations. Observability becomes critical but harder to implement.
Operational risk: Code Mode introduces more moving parts. Schema generation can break. Sandboxes can leak resources. Generated code can hit edge cases that your testing missed. Teams accustomed to deterministic tool-call patterns will find Code Mode requires more sophisticated monitoring.
Code Mode is practical and elegant for the right use cases. Adopt it where:
Stick with traditional tool calling where:
The irony, as I have already noted, is that we’ve built layers of abstraction to teach LLMs what documentation already expresses. MCP schemas get transpiled back into code interfaces because models are better at reading code than JSON. We’ve circumnavigated the globe to reach our neighbor’s house.
But the journey matters. MCP provided standardization. Code Mode provides efficiency. Together, they enable agents to handle hundreds of tools without drowning in token costs with direct MCP specs in the context.
The real question isn’t whether CodeMode is elegant. It’s whether the operational complexity is worth the token savings and orchestration power for your specific use case. Choose wisely.