Anthropic Turns MCP Agents Into Code-First Systems With the ‘Code Execution with MCP’ Approach

Introduction

In the rapidly evolving world of AI agents, one of the biggest bottlenecks has been context and tool integration: giving large-language-models (LLMs) meaningful access to external data, services and workflows in a scalable way. With their open-standard protocol Model Context Protocol (MCP), Anthropic introduced a way to standardize how an agent can access tools and data. But now, the company is pushing further: transforming MCP-enabled agents from “tool & hook” workflows into code-first systems, by letting agents write and execute code (via a sandbox) rather than relying purely on direct tool calls. This is what they term “Code Execution with MCP”. MarkTechPost+2Simon Willison’s Weblog+2

In this article we’ll explore:

What is MCP and how did we get here
The motivation for moving to code-first agents
How Anthropic’s approach works: the architecture and workflow
Key benefits and trade-offs
Practical use-cases and implications for developers/enterprises
Safety, governance and future outlook

What is MCP — the foundation

Before diving into the code-first evolution, it’s important to understand MCP itself.

The challenge of tool integration

Traditional LLM-based agents often face the “integration explosion” problem: each external data source or tool (databases, CRMs, code-repos, file-systems, monitoring dashboards, etc) requires custom connector logic, custom prompts, and the model must consume tool definitions + responses as part of its context. This leads to token-bloat, latency, brittleness, and maintenance overhead. Medium+2Medium+2

MCP as the “USB-C port for AI”

Anthropic describes MCP as an open standard protocol that “standardizes how applications provide context to LLMs” — just like a USB-C port provides a universal interface to peripherals and accessories. Claude Docs+2Simon Willison’s Weblog+2

In simplified terms:

The MCP Host (the agent environment or LLM client) can connect to one or multiple MCP Servers which each expose sets of tools, resources or prompts. Medium+1
The host queries the server for available tools/resources, chooses which tool(s) to call, supplies arguments, gets structured responses back — all in a standardized schema rather than many bespoke “agent→tool” pipelines. Confluent+1
Because the protocol is model-agnostic and tool-agnostic, it allows simpler scalability and re-use across agents and applications. Descope

Early adoption and limitations

MCP has already been adopted widely—developers built frameworks (e.g., the open-source mcp-agent on GitHub) to simplify building agents with MCP. GitHub+1

But as adoption grew, some patterns started to reveal inefficiencies:

Agents needed large catalogs of tool definitions in context, even tools they didn’t use.
When chaining many tool calls (tool A then tool B then tool C), the model often had to pass large intermediary payloads as plain text/context.
Token-usage and latency start to explode for complex workflows.
Anthropic themselves flagged these as pain-points of the “classic MCP tool invocation” approach. MarkTechPost+1

The leap: Code Execution with MCP

Recognizing the above inefficiencies, Anthropic has pushed a new paradigm: rather than having the agent continuously orchestrate tool calls in language/prompt-space, the agent writes code that uses the MCP tools as code APIs and then executes that code in a sandbox. This “code-first” approach is called Code Execution with MCP. Anthropic+1

What does “code-first” mean here?

Instead of:

Model: “Call tool X with arguments … then call tool Y with the result … then summarise …” (and supply all intermediate data as context)
Now:
Model: “Write a TypeScript (or other) file script.ts that imports tools/googleDrive/getDocument, tools/salesforce/updateRecord, executes them, processes results in code, and then returns a summary.”
Execution environment: The agent’s code is executed in a sandbox where these tool-wrappers call the MCP servers under the hood, but the large payloads (e.g., a full transcript) never pass through the language-model context. Simon Willison’s Weblog+1

How the architecture evolves

Anthropic outlines a pattern with three main steps (according to the MarkTechPost coverage). MarkTechPost

Generate a directory such as servers/ that mirrors all available MCP servers/tools.
Create thin wrapper functions (for example servers/google-drive/getDocument.ts) that call the MCP tool with typed parameters.
Agent writes code that imports these wrappers, composes them, handles control-flow, data-movement, and then executes in the sandbox.

Concrete example

From Anthropic’s blog: previously a workflow might look like “fetch transcript from Google Drive via MCP server A → pass transcript to model → model issues a second tool call via MCP server B (Salesforce) to update record → model summarises outcome”.
With code-first: you write code that does:

Here, the transcript doc.content never enters the language-model context except as a short summary or reduced form. MarkTechPost

Why this shift matters

The code-first approach with MCP brings several compelling benefits — and some challenges.

Benefits

Massive token & context efficiency
Anthropic reports an example where token usage fell from ~150,000 tokens (in a classical tool-chain workflow) to ~2,000 tokens when rewritten with the code-first approach—a ~98.7% reduction. MarkTechPost+1
Fewer tokens → lower cost → faster throughput.
Reduced latency and better tooling
Because intermediate payloads (large transcripts, big datasets) don’t flow through the LM, bottlenecks shrink. Also the code layer allows easier debugging, reuse, modularisation (i.e., a proper software-engineering mindset vs ad-hoc prompts).
Tool discovery & lazy loading
Rather than loading every tool definition into context at once, the filesystem-based interface allows the agent to explore directories, import only what’s needed, and keep the context window lean. Facilitates large catalogues of tools without overwhelming the model. unwind ai
Better composition and control flow
In-code we can handle branching, loops, retries, parallel calls, and error-handling more naturally. Agents become more like “code-first programmers” rather than just “prompt and tool call orchestrators”.
Separation of concerns
The “tool definitions” live in code modules; the “agent logic” writes code that uses these modules, so there’s cleaner architecture. The model doesn’t have to reason about every tool every time — just write the script. This increases maintainability, testing and observability.

Trade-offs & challenges

Infrastructure and sandboxing
Executing code means you need sandboxed environments, resource-limits, secure containerisation, file management. Anthropic notes that code execution “introduces its own complexity” (e.g., sandboxing, monitoring). Anthropic
Debugging & monitoring
With code (TypeScript/Python) modules, you now need conventional developer workflows: logging, error tracing, versioning. Agents need support for “why did this code fail” and “what happened in the sandbox”.
Security risks
Allowing models to write and execute code raises new attack surfaces: unintended operations, malicious code generation, uncontrolled tool invocations. Indeed the broader MCP ecosystem is already flagged for security concerns. arXiv+1
Agent competence in code writing
The agent must not only pick which tools to call, but write correct code that composes them. That introduces additional requirement for the LLM’s code-generation skill, module typing, async/await correctness, etc.
Latency of execution vs prompt-calling
Depending on the sandbox, code execution might add startup time (container spin-up). There can be trade-offs between “just call tool synchronously via prompt” vs “write code + run sandbox”.

Practical implications for developers and enterprises

For engineers, product-teams and enterprise AI architects, the “code-first MCP agent” model opens new possibilities.

Developer workflows

Using ‌Claude Code (Anthropic’s agentic coding assistant) with MCP now can embed tool-wrappers directly into your code-repo. For example, you could check in ./servers/mcp/* modules, invite the agent to extend workflows, and have the agent generate new scripts autonomously. Claude Docs+2Anthropic+2
Projects can version-control tool wrappers, offer documented APIs to the agent, and treat agent-logic as code (alongside tests, CI, etc).
Code reuse: Once tool-wrappers exist, many workflows can be scripted by the agent.
Efficiency: Developers may spend less time crafting prompts and piping data; more time supervising agent-written scripts, reviewing, and refining.

Enterprise scale & tool-ecosystem

Enterprises often have internal systems (CRMs, ERPs, monitoring platforms, knowledge-bases). With MCP + code-first, you can expose those as MCP servers with wrappers, and let agents build rigid workflows via code that interact with those systems.
The token-cost savings at scale matter: large organisations with many agent workflows will find context-window efficiency critical.
Governance and observability become easier: code gives you audit-logs, version control, role-based access for modules, sandboxed execution logs, etc.
Standardisation: By adopting MCP across teams, you reduce the “every team builds custom tool-connector” problem; with wrappers you share modules, and agent-logic becomes code across tools.

Example use-cases

Automated bug triage + fix: Agent fetches issue from Jira (MCP server), reads source code, writes fix, runs tests, commits PR—all via wrappers and script.
Customer support escalations: Agent queries CRM to fetch customer context, analyses logs, writes an internal summary, updates ticket status.
Regulatory reporting: Agent collects data from databases, runs transformation script, composes report, and submits to regulator via API.
In each of these, moving heavy payloads via model context is inefficient; code-first keeps the heavy lifting outside the LLM.

How to implement Code Execution with MCP: a practical guide

Here’s a blueprint for teams looking to adopt this paradigm.

1. Expose your tools as MCP servers

Build or deploy MCP servers for your data sources and APIs. (You may use open-source frameworks like mcp-agent, FastMCP etc.) GitHub+1
Ensure the servers expose typed tool definitions (inputs/outputs) and resource endpoints.

2. Generate wrapper modules

Create a folder structure in your codebase (e.g., servers/google-drive/, servers/salesforce/).
For each tool, scaffold a wrapper function (in TS or Python) that calls the MCP endpoint.

export async function getDocument(input: { documentId:string }): Promise<{ content:string }> { return callMCPTool('google_drive__get_document', input); }

3. Instruct the agent (LLM) to write scripts

Ask the agent to write a script (say workflow.ts) that imports required wrappers, orchestrates logic, handles control-flow, error-handling.
Example prompt: “Using the wrappers in servers/, write a script to fetch meeting transcript, summarise it, and update the Salesforce record. Make sure to catch errors and log results.”

4. Execute in sandbox

Provide a sandboxed environment (container, code-runner) where the generated script runs safely, with restricted scope.
Monitor execution, capture logs, ensure resource limits and security boundaries.

5. Monitor agent outputs, review code

Treat the agent as a code-generator: review its scripts, add tests, integrate into CI.
Track performance metrics (token usage, latency, success rate) to validate efficiency gains.

6. Iterate & reuse modules

Over time, wrap more tools, refine wrappers, build higher-level “skills” modules (e.g., ./skills/emailCampaign.ts) for reuse.
Encourage agents to import modules rather than reinvent logic each time—the code-first architecture supports this.

Best practices

Document each tool-wrapper with usage examples so the agent knows how to call them. Anthropic
Limit which tool-wrappers the agent can use (governance).
Log and version everything—script, tool-wrappers, execution results.
Use sandboxing and resource quotas for safety.
Maintain a clean separation: tool-wrapper code (infrastructure) vs agent-script (logic) vs generated code (workflow) so that you can test, version, review each layer.

Why this matters for the future of AI-agents

Scaling agentic workflows

If agents remain constrained to “call tool X → pass result through model → call tool Y”, then token-limits, latency, context-window size will keep blocking more ambitious workflows (multi-step, cross-tool, heavy-data). The code-first model significantly raises the ceiling on complexity and scale.

Making agents look more like “software engineers”

With code generation + execution, agents cease being just “smart chatbots” and become closer to “software‐engineers”: they write orchestrated code, modularise logic, debug, compose tools. This is a shift from prompt-engineering to code-engineering + prompt-guidance.

Lower cost and higher velocity

In enterprise settings, reducing token usage by ~90+ % (in some cases ~98.7 %) means lower operational cost, faster turnaround, and better latency. Agents become more viable for mission-critical workflows.

Ecosystem & modularisation

Tool wrappers become shared, agent-scripts become reusable—this opens the door for marketplaces of ‘skills’ (module bundles), versioned code templates, and standard libraries of tool-wrappers for MCP servers. You could think of an “agent-skills store” built around code modules.

Standards push

Because MCP is open-standard, this code-first model could become a dominant architecture for agent‐tool integration. Other providers (e.g., OpenAI) have already adopted MCP in part. Prompthub+1 Having a code-first standard could set a new precedent.

Risks, governance & change-management

Security risks

Executing arbitrary code via agents inevitably raises risk: sandbox escapes, malicious code injection, tool misuse. The MCP ecosystem has already flagged such risks in academic audits. arXiv+1
Hence, organisations must adopt strict governance: whitelisting of tool-wrappers, sandbox isolation, audit logs, role-based access, runtime quotas.

Debugging & understanding agent behaviour

As agents generate more complex code, understanding their decision-logic, tracing bug-sources, ensuring correctness becomes harder. Teams need robust observability.

Maintenance of wrapper modules

Over time, the wrapper-modules become part of your infrastructure and must be maintained, version-controlled, tested, and documented just like any other API library. This may require different skill-sets (DevOps + AI-architect) than earlier prompt-only workflows.

Change-management for teams

Engineers, AI-builders and product-owners must align on how agent-generated scripts fit with software-engineering pipelines (CI/CD, code-review, testing). A cultural shift: prompt-engineering becomes one facet, but code-governance becomes equally important.

Ethical & compliance considerations

When agents act via code against enterprise systems (e.g., databases, production services), you must ensure data-privacy, audit trails, accountability (who approved the agent, what did it change), compliance (e.g., financial, regulatory).

Future outlook

Agent tool ecosystems will expand

With the code-first MCP approach, you’ll see “tool-wrapper libraries” proliferate: collections of wrappers for Slack, GitHub, Figma, Asana, Postgres, Salesforce, etc. Agents will browse/import modules as libraries rather than rely on built-in tool lists.

Agents as low-code developers

As the abstraction rises, non-engineer users may specify high-level goals, and agents will generate and run scripts accordingly (with oversight). Code-first means even citizen developers can build workflows by specifying “Write script that …” rather than crafting prompts for each tool.

Inter-agent collaboration and orchestration

Multiple agents could build on shared code modules and orchestrate workflows: one agent writes code, another reviews/validates, another deploys/schedules. MCP + code execution becomes the backbone of multi-agent systems.

Venture of “agent skills marketplace”

Just as we have npm/ PyPI for libraries, we may see a marketplace of agent-skills (code modules + wrappers + workflows) that teams can plug into their codebase—further accelerating adoption.

Research & benchmarking

New benchmarks will emerge (e.g., how efficiently agents invoke tools, how well they orchestrate multi-step workflows) to evaluate the code-first model. For example OSWorld-MCP measures tool invocation rates. arXiv

Conclusion

Anthropic’s shift from “LLM + tool calls via MCP” to “LLM writes code that calls MCP-tools” is a major architectural move. By making agents code-first systems, they tackle one of the key bottlenecks in agentic workflows: token bloat, latency and brittle tool chaining. The reduction in token usage (98.7 % in one example) points to large operational gains.

For developers and enterprises this means: a more scalable way to build agentic applications, a code-centric mindset (wrappers, modules, sandboxed execution), and greater reuse, maintainability and efficiency. But it also brings new responsibilities: infrastructure for sandboxing, robust governance, code-review for agent-generated scripts, security monitoring, and aligning agents with standard software-engineering practices.

If you are building or planning to build AI agents in 2025/26 and beyond, considering the “code-first + MCP” architecture ought to be high on your agenda.

For quick updates, follow our whatsapp –https://whatsapp.com/channel/0029VbAabEC11ulGy0ZwRi3j

https://bitsofall.com/how-to-build-advanced-multi-page-reflex-web-application-with-real-time-database-dynamic-state-management-and-reactive-ui/

https://bitsofall.com/https-yourdomain-com-kimi-k2-thinking-by-moonshot-ai-a-new-era-of-thinking-agents/

OpenAI Introduces IndQA: A Cultural-Reasoning Benchmark for India’s Languages

Comparing the Top 7 Large Language Models (LLMs/Systems) for Coding in 2025