Tool Masking: The Layer MCP Forgot

By Frank Wittkampf & Lucas Vieira

MCP and related providers have been a breakthrough in AI connectivity¹: an enormous leap ahead when we have to expose providers rapidly and nearly effortlessly to an LLM. However in that additionally lies the issue: that is bottom-up considering. “Hey, why don’t we expose every little thing, all over the place, abruptly?”

Uncooked publicity of APIs comes at a price: each instrument floor pushed straight into an agent bloats prompts, inflates alternative entropy, and drags down execution high quality. A well-designed AI agent begins use-case down quite than tech-up. If you happen to would design your LLM name from scratch, you’d by no means present the complete unfiltered floor of an API. You’ve added pointless tokens, unrelated data, extra failure modes, and usually degraded high quality. Empirically, broad instrument definitions devour massive token budgets: e.g., one 28-parameter instrument ≈1,633 tokens; 37 instruments ≈6,218 tokens, which degrades accuracy and will increase latency/cost⁶.

In our job, constructing Enterprise-scale AI options for the biggest tech firms (MSFT, AWS, Databricks, and plenty of others), the place we ship hundreds of thousands of tokens a minute to our AI suppliers, these nuances matter. If you happen to optimize instrument publicity, which means you optimize your LLM execution context, which suggests you’re bettering high quality, accuracy, consistency, value, and latency, all on the identical time.

This text will outline the novel idea of Device masking. Many individuals will have already got implicitly experimented with this, nevertheless it’s a subject not properly explored in on-line publications to this point. Device masking is an important, and lacking layer within the present agentic stack. A instrument masks shapes what the mannequin really sees, each earlier than and after execution, so your AI agent cannot simply be related however really enabled.

So, rounding up our intro: utilizing uncooked MCP pollutes your LLM execution. How do you optimize the model-facing floor of a instrument for a given agent or job? You employ instrument masking. A easy idea, however as at all times, the satan is within the particulars.

What MCP does properly, and what it doesn’t

MCP will get quite a bit proper. It’s an open protocol, and Anthropic refers to it because the “USB-C for AI”. A technique to join LLM apps with exterior instruments and information with out friction¹. It nails the fundamentals: standardizing how instruments, sources, and prompts are described, found, and invoked, whether or not you’re utilizing JSON-RPC over stdio or streaming over HTTP². Auth is dealt with cleanly on the transport layer³. That’s why you see it touchdown all over the place from OpenAI’s Brokers SDK to Copilot in VS Code, all the best way to AWS guidance⁴. MCP is actual, and adoption is powerful.

However it’s equally essential to see what MCP doesn’t do – and that’s the place the gaps present up. MCP’s focus is context trade. It doesn’t care how your app or agent really makes use of the context you cross in, or the way you handle and form that context per agent or job. It exposes the complete instrument floor, however doesn’t form or filter it for high quality or relevance. Per the structure docs, MCP “focuses solely on the protocol for context trade. It doesn’t dictate how AI functions use LLMs or handle the offered context².” You get a discoverable catalog and schemas, however no built-in mechanism within the protocol that enables to optimize how the context is offered.

Notice: Some SDKs now add non-obligatory filtering — for instance, OpenAI’s Brokers SDK helps static/dynamic MCP instrument filtering⁵ . It is a step in the fitting route, however nonetheless leaves an excessive amount of on the desk.

1. Anthropic MCP overview
2. Model Context Protocol — Architecture
3. MCP Spec — Authorization
4. OpenAI Agents SDK (MCP); VS Code MCP GA; AWS — Unlocking MCP
5. GitHub — PR #861 (MCP tool filtering)
6. Medium — How many tools/functions can an AI Agent have?

The Drawback in Observe

As an example this, let’s take the (unofficial) Yahoo Finance API. Like many APIs, it returns a large JSON object full of dozens of metrics. Highly effective for evaluation, however overwhelming when your agent merely must retrieve one or two key figures. As an example my level, right here’s a snippet of what the agent would possibly obtain when calling the API:

yahooResponse = {
  "quoteResponse": {
    "end result": [
      {
        "symbol": "AAPL",
        "regularMarketPrice": 172.19,
        "marketCap": ...,
        …
        …
        # … roughly 100 other fields

# Other fields: regularMarketChangePercent, currency, marketState, exchange,
  fiftyTwoWeekHigh/Low, trailingPE, forwardPE, earningsDate, 
  incomeStatementHistory, financialData (with revenue, grossMargins, etc.), 
  summaryProfile, etc.

For an agent, getting a 100 fields of data, among other tool output, is overwhelming: irrelevant data, bloated prompts, and wasted tokens. It’s obvious that accuracy goes down as tool counts and schema sizes grow; researchers have shown that as the toolset expands, retrieval and invocation reliability drops sharply¹, and inputting every tool into the LLM quickly becomes impractical due to context length and latency constraints². This obviously depends on the model, but as models grow more capable, tool demands are increasing as well. Even state-of-the-art models still struggle to effectively select tools in large tool libraries³.

The problem is not limited to tool output. The more important problem is the API input schema. Going back to our example, for the Yahoo Finance API, you can request any combination of modules: assetProfile, financialData, price, earningsTrend, and many more. If you expose this schema to your agent raw, through MCP (or fastAPI, etc.), you’ve just massively polluted your agent context. At massive scale, this becomes even more challenging; recent work notes that LLMs operating on very large tool graphs require new approaches such as structured scoping or graph-based methods⁴.

Tool definitions consume tokens in every conversation turn; empirical benchmarks show that large, multi-parameter tools and big toolsets quickly dominate your prompt budget⁵. Without a filtering or rewriting layer, the accuracy and efficiency of your AI agent degrade⁶.

Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval “Identifying the most relevant tools … becomes a key bottleneck as the toolset size grows, hindering reliable tool utilization.”
Towards Completeness-Oriented Tool Retrieval for LLMs “…it is impractical to input all tools into LLMs due to length limitations and latency constraints.
Deciding Whether to Use Tools and Which to Use “…the majority [of LLMs] nonetheless wrestle to successfully choose instruments…”
ToolNet: Connecting LLMs with Massive Tools via Tool Graph “It stays difficult for LLMs to function on a library of large instruments,” motivating graph-based scoping.
How many tools/functions can an AI Agent have? (Feb 2025): reviews {that a} instrument with 28 params consumed 1,633 tokens; a set of 37 instruments consumed 6,218 tokens.
Benchmarking Tool Retrieval for LLMs (ToolRet) Massive-scale benchmark displaying instrument retrieval is tough even for robust IR fashions.

Right here’s a pattern instrument definition in the event you expose the uncooked API with out making a customized instrument for it (this has been shortened for readability):

yahooFinanceTool = {
 "title": "yahoo.quote_summary",
 "parameters": {
    "sort": "object",
    "properties": {
      "image": {"sort": "string"},
      "modules": {
        "sort": "array",
        "objects": {"sort": "string"},
        "description": "Choose any of: assetProfile, financialData, value, 
          earningsHistory, incomeStatementHistory, balanceSheetHistory, 
          cashflowStatementHistory, summaryDetail, quoteType, 
          recommendationTrend, secFilings, fundOwnership, 
          … (and dozens extra modules)"
      },
    # … plus extra parameters: area, lang, overrides, filters, and so forth.
    },
  "required": ["symbol"]
  }
}

The Repair

Right here’s the true unlock: with instrument masking, you’re answerable for the floor you current to your agent. You aren’t compelled to show your complete API, and also you don’t should recode your integrations for each new use case.

Need the agent to solely ever fetch the newest inventory quote? Construct a masks that presents simply that motion as a easy instrument.

Must assist a number of distinct duties, like fetching a quote, extracting solely income, or possibly toggling between value sorts? You may design a number of slim instruments, every with its personal masks on high of the identical underlying instrument handler.

Or, you would possibly mix associated actions right into a single instrument and provides the agent an specific toggle or enum, no matter interface matches the agent’s context and job.

Wouldn’t or not it’s nicer if the agent solely noticed quite simple, purpose-built instruments, like these?

# Easy Device: Get newest value and market cap
fetchPriceAndCap = {
  "title": "get_price_and_marketcap",
  "parameters": {
    "sort": "object",
    "properties": {
    "image": {"sort": "string"}
   },
   "required": ["symbol"]
  }
}

# Easy Device 2: Get firm income solely
fetchRevenue = {
  "title": "get_revenue",
  "parameters": {
    "sort": "object",
    "properties": {
      "image": {"sort": "string"}
    },
    "required": ["symbol"]
  }  
}

The underlying code makes use of the identical handler. No have to duplicate logic or drive the agent to cause concerning the full module floor. Simply totally different masks for various jobs — no module lists, no bloat, no recoding*.

* This aligns with steering to make use of solely important instruments, reduce parameters, and, the place attainable, activate instruments dynamically for a given interplay.

The facility of instrument masking

The purpose: Device masking is not only about hiding complexity. It’s about designing the fitting agent-facing floor for the job at hand.

You may expose an API as one instrument or many.
You may tune what’s required, non-obligatory, and even fastened (hard-coded values).
You may current totally different masks to totally different brokers, based mostly on position, context, or enterprise logic.
You may refactor the floor at any time — with out rewriting the handler or backend code.

This isn’t simply technical hygiene — it’s a strategic design choice. It enables you to ship cleaner, leaner, extra sturdy brokers that do precisely what’s wanted, no extra and no much less.

That is the facility of instrument masking:

Begin with a broad, messy API floor
Outline as many slim masks as wanted — one for every agent use case
Current solely what issues (and nothing extra) to the mannequin

The end result? Smaller prompts, sooner responses, fewer misfires — and brokers that get it proper, each time. Why does this matter a lot, particularly at enterprise scale?

Alternative entropy: When the mannequin is overloaded with choices, it’s extra more likely to misfire or choose the unsuitable fields
Efficiency: Additional tokens imply larger value, extra latency, decrease efficiency, much less accuracy, much less consistency
Enterprise scale: If you’re sending hundreds of thousands of tokens per minute, small inefficiencies rapidly add up. Precision issues. Fault tolerance is decrease. (Massive instrument outputs may echo by means of histories and balloon spend)¹

1. Everything Wrong with MCP

The Answer

On the coronary heart of strong instrument masking is a clear separation of considerations.

First, you’ve got the instrument handler — that is the uncooked integration, whether or not it’s a third-party API, inside service, or direct operate name. The handler’s job is solely to show the full functionality floor, with all its energy and complexity.

Subsequent comes the instrument masks. The masks defines the model-facing interface — a slim schema, tailor-made enter and output, and wise defaults for the agent’s use case or position. That is the place the broad, messy floor of the underlying instrument is slimmed down to precisely what’s wanted (and nothing extra).

In between sits the tooling service. That is the mediator that applies the masks, validates the enter, interprets agent requests into handler calls, and validates or sanitizes responses earlier than returning them to the mannequin.

^{Excessive Stage Overview— Device Masks}

Ideally, you retailer and handle instrument masks in the identical place that you simply retailer all of your different agent/system prompts, as a result of, in observe, presenting a instrument to an LLM is a type of immediate engineering.

Let’s evaluation an instance of an precise instrument masks. Our definition of a instrument masks has advanced over the previous couple of years¹. Beginning as a easy filter, to a full enterprise service, utilized by the biggest tech firms on the earth.

1. Initially (in 2023), we began with easy enter/output adapters, however over as we labored throughout a number of firms and plenty of use circumstances, it has advanced to a full immediate engineering floor.

Device masks instance

tool_name: stock_price
description: Retrieve the newest market value for a inventory image by way of Yahoo Finance.

handler_name: yahoo_api

handler_input_template:
  session_id: "{{ context.session_id }}"
  image: "{{ enter.image }}"
  modules:
    - value

output_template: |
  {
    "information": {
      "image": "{{ end result.quoteResponse.end result[0].image }}",
      "market_price": "{{ end result.quoteResponse.end result[0].regularMarketPrice }}",
      "forex": "{{ end result.quoteResponse.end result[0].forex }}"
    }
  }

input_schema:
  sort: object
  properties:
    image:
      sort: string
      description: "The inventory ticker image (e.g., AAPL, MSFT)"
  required: ["symbol"]

custom_validation_template: |
   string %
  size > 6 or symbol_str != symbol_str.higher() %
      { "success": false, "error": "Image should be 1–6 uppercase letters." }
  {% endif %}

The instance above ought to converse for itself, however let’s spotlight a couple of traits:

The masks interprets the enter (offered by the AI agent) to a handler_input (what the need API obtain)
The handler for this explicit instrument is an API, it may simply as properly have been every other service. The service may produce other masks on high of it, which pull different information out of the identical API
The masks permits for Jinja*. This enables for highly effective immediate engineering
A customized validation may be very highly effective if you wish to add particular nudges that steer the AI agent to self-correct its errors
The session_id and the module are hard-coded into the template. The AI agent isn’t capable of modify these

*Notice: in the event you’re doing this in a nodeJS atmosphere, EJS is nice for this as properly.

With this structure, you possibly can flexibly add, take away, or modify instrument masks with out ever touching the underlying handler or agent code. Device masking turns into a “configurable immediate engineering” layer, supporting speedy iteration, testing, and sturdy, role- or use-case-specific agent habits.

Hey, it’s nearly as if a instrument has turn into a immediate…

The Neglected Immediate Engineering Floor

Instruments are prompts. It’s attention-grabbing that in immediately’s AI blogs, there’s little reference of it. An LLM receives textual content after which generates textual content. Device names, instrument descriptions, and their enter schemas are a part of the incoming textual content. Instruments are prompts, simply with a particular taste.

When your code makes an LLM name, the mannequin reads the complete immediate enter, after which decides whether or not and find out how to name a tool¹²³. If we conclude that instruments are primarily prompts, then I hope when you’re studying this you’re having the next realization:

Instruments must be immediate engineered, and thus any immediate engineering method I’ve at my disposal additionally must be utilized to my tooling:

Instruments are context dependent! Device descriptions ought to match with the remainder of the immediate context.
Device naming issues, quite a bit!
Device enter floor provides tokens and complexity, and thus must be optimized.
Equally for the instrument output floor.
The framing and phrasing of instrument error responses issues, an agent will self-correct in the event you present it the fitting response.

In observe, I see many examples the place engineers present intensive directions relating to a selected instrument’s use in the primary immediate of the agent. It is a observe that we should always query. Ought to the directions on find out how to use a instrument reside within the bigger agent-prompt, or with the instrument? Some instruments want solely a brief abstract; others profit from richer steering, examples, or edge-case notes so the mannequin selects them reliably and codecs arguments appropriately. With masking, you possibly can adapt the identical underlying API to totally different brokers and contexts by tailoring the instrument description and schema per masks. Maintaining that steering co-located with the instrument floor stabilizes the contract and avoids drifting chat prompts (see Anthropic’s Device use and Greatest practices for instrument definitions). If you additionally specify output construction, you enhance consistency and parse-ability¹. Masks make this editable by immediate engineers as an alternative of burying it in (Python) code.

Operationally, we should always deal with masks as configurable prompts for instruments. Virtually, we suggest that you simply retailer the masks in the identical layer that hosts your prompts. Ideally, it is a config system that helps templating (e.g., Jinja), variables, and analysis. These ideas are equally usable for instrument masks as to your common prompts. Moreover, we suggest you model them, scope by agent or position, and use these instrument masks to repair defaults, disguise unused params, or break up one broad handler into a number of clear surfaces. Device masks even have safety advantages, permitting particular params are offered by the system, as an alternative of the LLM. (Unbiased critiques additionally spotlight value/security dangers from unbounded instrument outputs. But one more reason to constrain surfaces⁴.)

Completed properly, masking extends immediate engineering to the instrument boundary the place the mannequin really acts, yielding cleaner habits and extra constant execution.

1. Anthropic — Tool Use Overview
2. OpenAI — Tools Guide
3. OpenAI Cookbook — Prompting Guide
4. Everything Wrong with MCP

Design Patterns

A number of easy patterns cowl most masking wants. Begin with the smallest floor that works, then increase solely when a job really calls for it.

Schema Shrink: Restrict parameters to what the duty wants; constrain sorts and ranges; prefill invariants.
Function-Scoped View: Current totally different masks to totally different brokers or contexts; identical handler, tailor-made surfaces.
Functionality Gate: Expose a centered subset of operations; break up a mega-tool into single-purpose instruments; implement allowlists.
Defaulted Args: Set good defaults and conceal nonessential choices to chop tokens and variance.
System-Supplied Args: Inject tenant, account, area, or coverage values from the system; the LLM can not change them, which improves safety and consistency.
Toggle/Enum Floor: Mix associated actions into one instrument with an specific enum or mode; no free-text switches.
Typed Outputs: Return a small, strict schema; normalize items and keys for dependable parsing and analysis.
Progressive Disclosure: Ship the minimal masks first; add non-obligatory fields by way of new masks variations solely when wanted.
Validation: Permit customized enter validation at instrument masks stage; set constructive validation responses to information the agent in the fitting route

Conclusion

Connectivity solved the what. Execution is the how. Providers like MCP join instruments. Device masking makes them carry out by shaping the model-facing floor to suit the duty and the agent which can be working with it.

Assume use case down, not tech up. One handler, many masks. Slender inputs, outputs, and experiment and immediate engineer your instrument floor to perfection. Preserve the outline with the instrument, not buried in chat textual content or code. Deal with masks as configurable prompts you could model, take a look at, and assign per agent.

If you happen to expose uncooked surfaces, you pay for entropy: extra tokens, slower latency, decrease accuracy, inconsistent habits. Masks flip that curve. Smaller prompts. Sooner responses. Larger cross charges. Fewer misfires. The impression of this method compounds at enterprise scale. (Even MCP advocates word that discovery lists every little thing, with out curation, and that brokers ship/think about an excessive amount of information.)

So, what to do?

Put a masking layer between brokers and each broad API
Strive a number of masks on one handler, and customizing a masks to see the way it impacts efficiency
Retailer masks together with your prompts in config; model and iterate
Transfer instrument directions into the instrument floor, and out of system prompts.
Present wise defaults, and conceal what the mannequin mustn’t contact

Cease delivery mega instruments. Ship surfaces. That’s the layer MCP forgot. The step that turns an agent from related into enabled.

Drop us a touch upon LinkedIn in the event you appreciated this text!

Concerning the authors:
Lucas and Frank have tightly labored collectively on AI infrastructure throughout a number of firms (and advising a handful of others) – from among the earliest multi-agent groups, to LLM supplier administration, to doc processing, to Enterprise AI automation. We work at Databook, a innovative AI automation platform for the world’s largest tech firms (MSFT, AWS, Databricks, SalesForce, and others), which we empower with a variety of options utilizing passive/proactive/guided AI for actual world, enterprise manufacturing functions.

Source link

Reading Research Papers in the Age of LLMs

The Machine Learning “Advent Calendar” Day 6: Decision Tree Regressor

TDS Newsletter: How to Design Evals, Metrics, and KPIs That Work

LMArena lanserar ny beta för AI-battle och användarröstning

Building a Geospatial Lakehouse with Open Source and Databricks

How AI is interacting with our creative human processes

Krea AI:s nya realtidsvideogenerering – AI nyheter

The Kolmogorov–Smirnov Statistic, Explained: Measuring Model Power in Credit Risk Modeling

Most Popular

What Can the History of Data Tell Us About the Future of AI?

How to Classify Lung Cancer Subtype from DNA Copy Numbers Using PyTorch

How to Make AI Assistants That Elevate Your Creative Ideation with Dale Bertrand [MAICON 2025 Speaker Series]

Our Picks