Anthropic Turns MCP Agents Into Code-First Systems With the ‘Code Execution with MCP’ Approach
Introduction
In the rapidly evolving world of AI agents, one of the biggest bottlenecks has been context and tool integration: giving large-language-models (LLMs) meaningful access to external data, services and workflows in a scalable way. With their open-standard protocol Model Context Protocol (MCP), Anthropic introduced a way to standardize how an agent can access tools and data. But now, the company is pushing further: transforming MCP-enabled agents from “tool & hook” workflows into code-first systems, by letting agents write and execute code (via a sandbox) rather than relying purely on direct tool calls. This is what they term “Code Execution with MCP”. MarkTechPost+2Simon Willison’s Weblog+2
In this article we’ll explore:
-
What is MCP and how did we get here
-
The motivation for moving to code-first agents
-
How Anthropic’s approach works: the architecture and workflow
-
Key benefits and trade-offs
-
Practical use-cases and implications for developers/enterprises
-
Safety, governance and future outlook
What is MCP — the foundation
Before diving into the code-first evolution, it’s important to understand MCP itself.
The challenge of tool integration
Traditional LLM-based agents often face the “integration explosion” problem: each external data source or tool (databases, CRMs, code-repos, file-systems, monitoring dashboards, etc) requires custom connector logic, custom prompts, and the model must consume tool definitions + responses as part of its context. This leads to token-bloat, latency, brittleness, and maintenance overhead. Medium+2Medium+2
MCP as the “USB-C port for AI”
Anthropic describes MCP as an open standard protocol that “standardizes how applications provide context to LLMs” — just like a USB-C port provides a universal interface to peripherals and accessories. Claude Docs+2Simon Willison’s Weblog+2
In simplified terms:
-
The MCP Host (the agent environment or LLM client) can connect to one or multiple MCP Servers which each expose sets of tools, resources or prompts. Medium+1
-
The host queries the server for available tools/resources, chooses which tool(s) to call, supplies arguments, gets structured responses back — all in a standardized schema rather than many bespoke “agent→tool” pipelines. Confluent+1
-
Because the protocol is model-agnostic and tool-agnostic, it allows simpler scalability and re-use across agents and applications. Descope
Early adoption and limitations
MCP has already been adopted widely—developers built frameworks (e.g., the open-source mcp-agent on GitHub) to simplify building agents with MCP. GitHub+1
But as adoption grew, some patterns started to reveal inefficiencies:
-
Agents needed large catalogs of tool definitions in context, even tools they didn’t use.
-
When chaining many tool calls (tool A then tool B then tool C), the model often had to pass large intermediary payloads as plain text/context.
-
Token-usage and latency start to explode for complex workflows.
Anthropic themselves flagged these as pain-points of the “classic MCP tool invocation” approach. MarkTechPost+1
The leap: Code Execution with MCP
Recognizing the above inefficiencies, Anthropic has pushed a new paradigm: rather than having the agent continuously orchestrate tool calls in language/prompt-space, the agent writes code that uses the MCP tools as code APIs and then executes that code in a sandbox. This “code-first” approach is called Code Execution with MCP. Anthropic+1
What does “code-first” mean here?
Instead of:
-
Model: “Call tool X with arguments … then call tool Y with the result … then summarise …” (and supply all intermediate data as context)
Now: -
Model: “Write a TypeScript (or other) file
script.tsthat importstools/googleDrive/getDocument,tools/salesforce/updateRecord, executes them, processes results in code, and then returns a summary.” -
Execution environment: The agent’s code is executed in a sandbox where these tool-wrappers call the MCP servers under the hood, but the large payloads (e.g., a full transcript) never pass through the language-model context. Simon Willison’s Weblog+1
How the architecture evolves
Anthropic outlines a pattern with three main steps (according to the MarkTechPost coverage). MarkTechPost
-
Generate a directory such as
servers/that mirrors all available MCP servers/tools. -
Create thin wrapper functions (for example
servers/google-drive/getDocument.ts) that call the MCP tool with typed parameters. -
Agent writes code that imports these wrappers, composes them, handles control-flow, data-movement, and then executes in the sandbox.
Concrete example
From Anthropic’s blog: previously a workflow might look like “fetch transcript from Google Drive via MCP server A → pass transcript to model → model issues a second tool call via MCP server B (Salesforce) to update record → model summarises outcome”.
With code-first: you write code that does:
Here, the transcript doc.content never enters the language-model context except as a short summary or reduced form. MarkTechPost
Why this shift matters
The code-first approach with MCP brings several compelling benefits — and some challenges.
Benefits
-
Massive token & context efficiency
Anthropic reports an example where token usage fell from ~150,000 tokens (in a classical tool-chain workflow) to ~2,000 tokens when rewritten with the code-first approach—a ~98.7% reduction. MarkTechPost+1
Fewer tokens → lower cost → faster throughput. -
Reduced latency and better tooling
Because intermediate payloads (large transcripts, big datasets) don’t flow through the LM, bottlenecks shrink. Also the code layer allows easier debugging, reuse, modularisation (i.e., a proper software-engineering mindset vs ad-hoc prompts). -
Tool discovery & lazy loading
Rather than loading every tool definition into context at once, the filesystem-based interface allows the agent to explore directories, import only what’s needed, and keep the context window lean. Facilitates large catalogues of tools without overwhelming the model. unwind ai -
Better composition and control flow
In-code we can handle branching, loops, retries, parallel calls, and error-handling more naturally. Agents become more like “code-first programmers” rather than just “prompt and tool call orchestrators”. -
Separation of concerns
The “tool definitions” live in code modules; the “agent logic” writes code that uses these modules, so there’s cleaner architecture. The model doesn’t have to reason about every tool every time — just write the script. This increases maintainability, testing and observability.
Trade-offs & challenges
-
Infrastructure and sandboxing
Executing code means you need sandboxed environments, resource-limits, secure containerisation, file management. Anthropic notes that code execution “introduces its own complexity” (e.g., sandboxing, monitoring). Anthropic -
Debugging & monitoring
With code (TypeScript/Python) modules, you now need conventional developer workflows: logging, error tracing, versioning. Agents need support for “why did this code fail” and “what happened in the sandbox”. -
Security risks
Allowing models to write and execute code raises new attack surfaces: unintended operations, malicious code generation, uncontrolled tool invocations. Indeed the broader MCP ecosystem is already flagged for security concerns. arXiv+1 -
Agent competence in code writing
The agent must not only pick which tools to call, but write correct code that composes them. That introduces additional requirement for the LLM’s code-generation skill, module typing, async/await correctness, etc. -
Latency of execution vs prompt-calling
Depending on the sandbox, code execution might add startup time (container spin-up). There can be trade-offs between “just call tool synchronously via prompt” vs “write code + run sandbox”.
Practical implications for developers and enterprises
For engineers, product-teams and enterprise AI architects, the “code-first MCP agent” model opens new possibilities.
Developer workflows
-
Using Claude Code (Anthropic’s agentic coding assistant) with MCP now can embed tool-wrappers directly into your code-repo. For example, you could check in
./servers/mcp/*modules, invite the agent to extend workflows, and have the agent generate new scripts autonomously. Claude Docs+2Anthropic+2 -
Projects can version-control tool wrappers, offer documented APIs to the agent, and treat agent-logic as code (alongside tests, CI, etc).
-
Code reuse: Once tool-wrappers exist, many workflows can be scripted by the agent.
-
Efficiency: Developers may spend less time crafting prompts and piping data; more time supervising agent-written scripts, reviewing, and refining.
Enterprise scale & tool-ecosystem
-
Enterprises often have internal systems (CRMs, ERPs, monitoring platforms, knowledge-bases). With MCP + code-first, you can expose those as MCP servers with wrappers, and let agents build rigid workflows via code that interact with those systems.
-
The token-cost savings at scale matter: large organisations with many agent workflows will find context-window efficiency critical.
-
Governance and observability become easier: code gives you audit-logs, version control, role-based access for modules, sandboxed execution logs, etc.
-
Standardisation: By adopting MCP across teams, you reduce the “every team builds custom tool-connector” problem; with wrappers you share modules, and agent-logic becomes code across tools.
Example use-cases
-
Automated bug triage + fix: Agent fetches issue from Jira (MCP server), reads source code, writes fix, runs tests, commits PR—all via wrappers and script.
-
Customer support escalations: Agent queries CRM to fetch customer context, analyses logs, writes an internal summary, updates ticket status.
-
Regulatory reporting: Agent collects data from databases, runs transformation script, composes report, and submits to regulator via API.
In each of these, moving heavy payloads via model context is inefficient; code-first keeps the heavy lifting outside the LLM.
How to implement Code Execution with MCP: a practical guide
Here’s a blueprint for teams looking to adopt this paradigm.
1. Expose your tools as MCP servers
-
Build or deploy MCP servers for your data sources and APIs. (You may use open-source frameworks like
mcp-agent, FastMCP etc.) GitHub+1 -
Ensure the servers expose typed tool definitions (inputs/outputs) and resource endpoints.
2. Generate wrapper modules
-
Create a folder structure in your codebase (e.g.,
servers/google-drive/,servers/salesforce/). -
For each tool, scaffold a wrapper function (in TS or Python) that calls the MCP endpoint.
3. Instruct the agent (LLM) to write scripts
-
Ask the agent to write a script (say
workflow.ts) that imports required wrappers, orchestrates logic, handles control-flow, error-handling. -
Example prompt: “Using the wrappers in
servers/, write a script to fetch meeting transcript, summarise it, and update the Salesforce record. Make sure to catch errors and log results.”
4. Execute in sandbox
-
Provide a sandboxed environment (container, code-runner) where the generated script runs safely, with restricted scope.
-
Monitor execution, capture logs, ensure resource limits and security boundaries.
5. Monitor agent outputs, review code
-
Treat the agent as a code-generator: review its scripts, add tests, integrate into CI.
-
Track performance metrics (token usage, latency, success rate) to validate efficiency gains.
6. Iterate & reuse modules
-
Over time, wrap more tools, refine wrappers, build higher-level “skills” modules (e.g.,
./skills/emailCampaign.ts) for reuse. -
Encourage agents to import modules rather than reinvent logic each time—the code-first architecture supports this.
Best practices
-
Document each tool-wrapper with usage examples so the agent knows how to call them. Anthropic
-
Limit which tool-wrappers the agent can use (governance).
-
Log and version everything—script, tool-wrappers, execution results.
-
Use sandboxing and resource quotas for safety.
-
Maintain a clean separation: tool-wrapper code (infrastructure) vs agent-script (logic) vs generated code (workflow) so that you can test, version, review each layer.
Why this matters for the future of AI-agents
Scaling agentic workflows
If agents remain constrained to “call tool X → pass result through model → call tool Y”, then token-limits, latency, context-window size will keep blocking more ambitious workflows (multi-step, cross-tool, heavy-data). The code-first model significantly raises the ceiling on complexity and scale.
Making agents look more like “software engineers”
With code generation + execution, agents cease being just “smart chatbots” and become closer to “software‐engineers”: they write orchestrated code, modularise logic, debug, compose tools. This is a shift from prompt-engineering to code-engineering + prompt-guidance.
Lower cost and higher velocity
In enterprise settings, reducing token usage by ~90+ % (in some cases ~98.7 %) means lower operational cost, faster turnaround, and better latency. Agents become more viable for mission-critical workflows.
Ecosystem & modularisation
Tool wrappers become shared, agent-scripts become reusable—this opens the door for marketplaces of ‘skills’ (module bundles), versioned code templates, and standard libraries of tool-wrappers for MCP servers. You could think of an “agent-skills store” built around code modules.
Standards push
Because MCP is open-standard, this code-first model could become a dominant architecture for agent‐tool integration. Other providers (e.g., OpenAI) have already adopted MCP in part. Prompthub+1 Having a code-first standard could set a new precedent.
Risks, governance & change-management
Security risks
Executing arbitrary code via agents inevitably raises risk: sandbox escapes, malicious code injection, tool misuse. The MCP ecosystem has already flagged such risks in academic audits. arXiv+1
Hence, organisations must adopt strict governance: whitelisting of tool-wrappers, sandbox isolation, audit logs, role-based access, runtime quotas.
Debugging & understanding agent behaviour
As agents generate more complex code, understanding their decision-logic, tracing bug-sources, ensuring correctness becomes harder. Teams need robust observability.
Maintenance of wrapper modules
Over time, the wrapper-modules become part of your infrastructure and must be maintained, version-controlled, tested, and documented just like any other API library. This may require different skill-sets (DevOps + AI-architect) than earlier prompt-only workflows.
Change-management for teams
Engineers, AI-builders and product-owners must align on how agent-generated scripts fit with software-engineering pipelines (CI/CD, code-review, testing). A cultural shift: prompt-engineering becomes one facet, but code-governance becomes equally important.
Ethical & compliance considerations
When agents act via code against enterprise systems (e.g., databases, production services), you must ensure data-privacy, audit trails, accountability (who approved the agent, what did it change), compliance (e.g., financial, regulatory).
Future outlook
Agent tool ecosystems will expand
With the code-first MCP approach, you’ll see “tool-wrapper libraries” proliferate: collections of wrappers for Slack, GitHub, Figma, Asana, Postgres, Salesforce, etc. Agents will browse/import modules as libraries rather than rely on built-in tool lists.
Agents as low-code developers
As the abstraction rises, non-engineer users may specify high-level goals, and agents will generate and run scripts accordingly (with oversight). Code-first means even citizen developers can build workflows by specifying “Write script that …” rather than crafting prompts for each tool.
Inter-agent collaboration and orchestration
Multiple agents could build on shared code modules and orchestrate workflows: one agent writes code, another reviews/validates, another deploys/schedules. MCP + code execution becomes the backbone of multi-agent systems.
Venture of “agent skills marketplace”
Just as we have npm/ PyPI for libraries, we may see a marketplace of agent-skills (code modules + wrappers + workflows) that teams can plug into their codebase—further accelerating adoption.
Research & benchmarking
New benchmarks will emerge (e.g., how efficiently agents invoke tools, how well they orchestrate multi-step workflows) to evaluate the code-first model. For example OSWorld-MCP measures tool invocation rates. arXiv
Conclusion
Anthropic’s shift from “LLM + tool calls via MCP” to “LLM writes code that calls MCP-tools” is a major architectural move. By making agents code-first systems, they tackle one of the key bottlenecks in agentic workflows: token bloat, latency and brittle tool chaining. The reduction in token usage (98.7 % in one example) points to large operational gains.
For developers and enterprises this means: a more scalable way to build agentic applications, a code-centric mindset (wrappers, modules, sandboxed execution), and greater reuse, maintainability and efficiency. But it also brings new responsibilities: infrastructure for sandboxing, robust governance, code-review for agent-generated scripts, security monitoring, and aligning agents with standard software-engineering practices.
If you are building or planning to build AI agents in 2025/26 and beyond, considering the “code-first + MCP” architecture ought to be high on your agenda.
For quick updates, follow our whatsapp –https://whatsapp.com/channel/0029VbAabEC11ulGy0ZwRi3j
OpenAI Introduces IndQA: A Cultural-Reasoning Benchmark for India’s Languages
Comparing the Top 7 Large Language Models (LLMs/Systems) for Coding in 2025







