Impressed by the ICLR 2026 blogpost/article, The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection
an Edinburgh-trained PhD in Info Retrieval from Victor Lavrenko’s Multimedia Info Retrieval Lab at Edinburgh, the place I educated within the late 2000s, I’ve lengthy seen retrieval by means of the framework of conventional IR pondering:
- Did we retrieve not less than one related chunk?
- Did recall go up?
- Did the ranker enhance?
- Did downstream reply high quality look acceptable on a benchmark?
These are nonetheless helpful questions. However after studying the latest work on Bits over Random (BoR), I believe they’re incomplete for the Agentic programs many people are actually truly constructing.
The ICLR blogpost sharpened one thing I had felt for some time in manufacturing LLM programs: retrieval high quality ought to have in mind each how a lot good content material we discover and likewise how a lot irrelevant materials we convey together with it. In different phrases, as we crank up our recall we additionally enhance the chance of context air pollution.
What makes BoR helpful is that it provides us a language for this. BoR tells us whether or not retrieval is genuinely selective, or whether or not we’re reaching success principally by stuffing the context window with extra materials. When BoR falls, it’s a signal that the retrieved bundle is changing into much less discriminative relative to likelihood. In follow, that always correlates with the mannequin being pressured to learn extra junk, extra overlap, or extra weakly related materials.
The necessary nuance is that BoR doesn’t immediately measure what the mannequin “feels” when studying a immediate. It measures retrieval selectivity relative to random likelihood. However decrease selectivity typically goes hand in hand with extra irrelevant context, extra immediate air pollution, extra consideration dilution, and worse downstream efficiency. Put merely, BoR helps inform us when retrieval remains to be selective and when it has began to degenerate into context stuffing.
That concept issues rather more for RAG and brokers than it did for traditional search.
Why retrieval dashboards can mislead agent groups
One of many best traps in RAG is to take a look at your retrieval dashboard, see wholesome metrics, and conclude that the system is doing properly. You would possibly see:
- excessive Success@Okay,
- robust recall,
- rating metric,
- and a bigger Okay seeming to enhance protection.
On paper issues could look higher however, in actuality, the agent would possibly truly behave worse. Your agent could have any variety of maladies similar to diffuse solutions to queries, unreliable device use or just an increase in latency and token value with none actual person profit.
This disconnect occurs as a result of most retrieval dashboards nonetheless replicate a human search worldview. They assume the buyer of the retrieved set can skim, filter, and ignore junk. People are surprisingly good at this. LLMs will not be constantly good at it.
An LLM doesn’t “discover” ten retrieved gadgets and casually give attention to the most effective two in the way in which a robust analyst would. It processes the complete bundle as immediate context. Meaning the retrieval layer is surfacing proof that’s actively shaping the mannequin’s working reminiscence.
Because of this I believe agent groups ought to cease treating retrieval as a back-office rating drawback and begin treating it as a reasoning-budget allocation drawback. When constructing performant agentic programs, the important thing query is each:
- Did we retrieve one thing related?
and:
- How a lot noise did we power the mannequin to course of as a way to get that relevance?
That’s the lens BoR pushes you towards, and I’ve discovered it to be a really helpful one.
Context engineering is changing into a first-class self-discipline
One cause this paper has resonated with me is that it matches a broader shift already occurring in follow. Software program engineers and ML practitioners engaged on LLM programs are steadily changing into one thing nearer to context engineers.
Meaning designing programs that determine:
- what ought to enter the immediate,
- when it ought to enter,
- in what type,
- with what granularity,
- and what ought to be excluded totally.
In conventional software program, we fear about reminiscence, compute, and API boundaries. In LLM programs, we additionally want to fret about context purity. The context window is contested cognitive actual property.
Each irrelevant passage, duplicated chunk, weakly associated instance, verbose device definition, and poorly timed retrieval end result competes with the factor the mannequin most must give attention to. That’s the reason I just like the air pollution metaphor. Irrelevant context contaminates the mannequin’s workspace.
The BoR poster provides this instinct a extra rigorous form by telling us that we should always cease evaluating retrieval solely by whether or not it succeeds. We also needs to ask how significantly better the retrieval is in comparison with likelihood, on the depth (prime Okay retrieved gadgets) that we are literally utilizing. That could be a very practitioner-friendly query.
Why device overload breaks brokers
That is the place I believe the BoR work turns into particularly necessary for real-world agent programs.
In basic RAG, the corpus is usually massive. You could be retrieving from tens of 1000’s or tens of millions of chunks. In that regime, random likelihood stays weak for longer. Instrument choice could be very completely different.
In an agent, the mannequin could also be selecting amongst 20, 50, or 100 instruments. That sounds manageable till you understand that a number of instruments are sometimes vaguely believable for a similar process. As soon as that occurs, dumping all instruments into context is just not thoroughness. It’s confusion disguised as completeness.
I’ve seen this sample repeatedly in agent design:
- the group provides extra instruments,
- descriptions turn out to be longer,
- overlap between instruments will increase,
- the agent begins making brittle or inconsistent decisions,
- and the primary intuition is to tune the immediate tougher.
However typically the actual difficulty is architectural, not prompt-level. The mannequin is being requested to select from an overloaded context the place distinctions are too weak and too quite a few.
What BoR provides here’s a helpful strategy to formalize one thing folks typically really feel solely intuitively: there’s a level the place the choice process turns into so crowded that the mannequin is not demonstrating significant selectivity.
That’s the reason I strongly choose agent designs with:
- Staged device retrieval: narrowing the search in steps, first discovering a small set of believable instruments, then making the ultimate alternative from that shortlist somewhat than from the complete library directly.
- Area routing: earlier than ultimate device alternative means first deciding which broad space the duty belongs to, similar to search, CRM, finance, or coding, and solely then choosing a selected device inside that area.
- Compressed functionality summaries: presenting every device with a brief, high-signal description of what it’s for, when it ought to be used, and the way it differs from close by instruments, as a substitute of dumping lengthy verbose specs into the immediate.
- Express exclusion of irrelevant instruments: intentionally eradicating instruments that aren’t applicable for the present process so the mannequin is just not distracted by believable however pointless choices.
In my expertise device alternative ought to be handled extra like retrieval than like static immediate ornament.
Understanding BoR by means of device choice
One of the crucial helpful issues about BoR is that it sharpens what top-Okay actually means in tool-using brokers.
In doc retrieval, rising top-Okay typically means shifting from top-5 passages to top-20 or top-50 from a really massive corpus. In device choice, the identical transfer has a really completely different character. When an agent solely has a modest device library, rising top-Okay could imply shifting from a shortlist of three candidate instruments, to five, to eight, and finally to the acquainted however harmful fallback: simply give all of it 15 instruments to be secure.
That usually improves recall or Success@Okay, as a result of the proper device is extra prone to be someplace within the seen set. However that enchancment might be deceptive. As Okay grows, you aren’t solely serving to the router. You’re additionally making it simpler for a random selector to incorporate a related device.
So the actual query is just not merely: Did top-8 include a great tool extra typically than top-3? The extra necessary query is: Did top-8 enhance significant selectivity, or did it principally make the duty simpler by means of brute-force inclusion?That’s precisely the place BoR turns into helpful.
A easy instance makes the instinct clearer. Suppose you could have 10 instruments, and for a given class of process 2 of them are genuinely related. Should you present the mannequin just one device, random likelihood of surfacing a related one is 20 %. At 3 instruments, the random baseline rises sharply. At 5 instruments, random inclusion is already pretty robust. At 10 instruments, it’s one hundred pc, as a result of you could have proven the whole lot. So sure, Success@Okay rises as Okay rises. However the that means of that success modifications. At low Okay, success signifies actual discrimination. At excessive Okay, success could merely imply you included sufficient of the menu that failure grew to become troublesome.
That’s what I imply by serving to random likelihood somewhat than significant selectivity.
This issues as a result of, with instruments, the issue is worse than a deceptive metric. While you present too many instruments, the immediate will get longer, descriptions start to overlap, the mannequin sees extra near-matches, distinctions turn out to be fuzzier, parameter confusion rises, and the prospect of selecting a plausible-but-wrong device will increase. So despite the fact that top-Okay recall improves, the standard of the ultimate resolution could worsen. That is the small-tool paradox: including extra candidate instruments can enhance obvious protection whereas reducing the agent’s potential to decide on cleanly.
A sensible method to consider that is that device choice typically falls into three regimes. Within the wholesome regime, Okay is small relative to the variety of instruments, and the looks of a related device within the shortlist tells you the router truly did one thing helpful. For instance, 30 whole instruments, 2 or 3 related, and a shortlist of three or 4 nonetheless looks like real choice. Within the gray zone, Okay is massive sufficient that recall improves, however random inclusion can be rising rapidly. For instance, 20 instruments, 3 related, shortlist of 8. Right here you should still acquire one thing, however you must already be asking whether or not you’re actually routing or merely widening the funnel. Lastly, there’s the collapse regime, the place Okay is so massive that success principally comes from exposing sufficient of the device menu that random choice would additionally succeed typically. You probably have 15 instruments, 3 related ones, and a shortlist of 12 or all 15, then “excessive recall” is not saying a lot. You’re getting near brute-force publicity.
Operationally, this pushes me towards a greater query. In a small-tool system, I like to recommend avoiding the overexposure mindset that asks:
- How massive should Okay be earlier than recall seems to be good?
The higher query is:
- How small can my shortlist be whereas nonetheless preserving robust process efficiency?
That mindset encourages disciplined routing.
In follow, that often means routing first and selecting second, conserving the shortlist very small, compressing device descriptions so distinctions are apparent, splitting instruments into domains earlier than ultimate choice, and testing whether or not rising Okay improves end-to-end process accuracy, not simply device recall. A helpful sanity examine is that this: if giving the mannequin all instruments performs about the identical as your routed shortlist, then your routing layer will not be including a lot worth. And if giving the mannequin extra instruments improves recall however worsens general process efficiency, you’re possible in precisely the regime the place Okay helps random likelihood greater than actual selectivity.
When the failure mode modifications: massive device libraries
The massive-tool case is completely different, and that is the place an necessary nuance issues. A bigger device universe does not imply we should always dump lots of of instruments into context and anticipate the system to work higher. It simply means the failure mode modifications.
If an agent has 1,000 instruments out there and solely a handful are related, then rising top-Okay from 10 to 50 and even 100 should still symbolize significant selectivity. Random likelihood stays weaker for longer than it does within the small-tool case. In that sense, BoR remains to be helpful: it helps cease us from mistaking broader publicity for higher routing. It asks whether or not a bigger shortlist displays real selectivity, or whether or not it’s merely serving to by exposing a bigger slice of the search house.
But BoR does not capture the whole problem here. With very large tool libraries, the issue may no longer be that random chance has become too strong. The issue may be that the model is simply drowning in options. A shortlist of 200 tools can still be better than random in BoR terms and yet still be a terrible prompt. Tool descriptions overlap, near-matches proliferate, distinctions become harder to maintain, and the model is forced to reason over a crowded semantic menu.
So BoR is valuable, but it is not sufficient on its own. It is better at telling us whether a shortlist is genuinely discriminative relative to chance than whether that shortlist is still cognitively manageable for the model. In large tool libraries, we therefore need both perspectives: BoR to measure selectivity, and downstream measures such as tool-choice quality, latency, parameter correctness, and end-to-end task success to measure usability.
BoR tells us whether retrieval is genuinely selective, or whether we are achieving success mostly by stuffing the context window with more material. When BoR falls, it is a sign that the retrieved bundle is becoming less discriminative relative to chance. In practice, that often correlates with the model being forced to read more junk, more overlap, or more weakly relevant material. The nuance is that BoR does not directly measure what the model “feels” when reading a prompt. It measures selectivity relative to random chance. But low BoR is often a warning sign that the model is being asked to process an increasingly noisy context window.
The design implication is the same even though the reason differs. With small tool sets, broad exposure quickly becomes bad because it helps random chance too much. With very large tool sets, broad exposure becomes bad because it overwhelms the model. In both cases, the answer is not to stuff more into context. It is to design better routing.
My own rule of thumb: the model should see less, but cleaner
If I had to summarize the practical shift in one sentence, it would be this: for LLM systems, smaller and cleaner is often better than larger and more comprehensive.
That sounds obvious, but many systems are still designed as if “more context” is automatically safer. In reality, once a baseline level of useful evidence is present, additional retrieval can become harmful. It increases token cost and latency, but more importantly it widens the field of competing cues inside the prompt.
I have come to think about prompt construction in three layers:
Layer 1: mandatory task context
- The core instruction, constraints, and immediate user objective.
Layer 2: highly selective grounding
- Only the minimum supporting evidence or tool definitions needed for the next reasoning step.
Layer 3: optional overflow
- Material that is merely plausible, loosely related, or included “just in case.”
Most failures come from letting Layer 3 invade Layer 2. That is why retrieval should be judged not just by coverage, but by its ability to preserve a clean Layer 2.
Where I think BoR is especially useful
I do not see BoR as a replacement for all retrieval metrics. I see it as a very useful additional lens, especially in these cases:
1. Choosing K in production
- Many teams still increase top-K until recall looks good enough. BoR encourages a more disciplined question: at what point is increasing K mostly helping random chance rather than meaningful selectivity?
2. Evaluating agent tool routing
- This may be the most compelling use-case. Agents often fail not because no good tool exists, but because too many nearly relevant tools are presented simultaneously.
3. Diagnosing why downstream quality falls despite “better retrieval”
- This is the classic paradox. Coverage goes up. Final answer quality goes down. BoR helps explain why.
4. Comparing systems with different retrieval depths
- Raw success rates can be deceptive when one system retrieves far more material than another. BoR helps normalize for that.
5. Preventing overconfidence in benchmark results
- Some benchmarks may simply be too easy at the chosen retrieval depth. A strong-looking result may be closer to luck than we think.
Where I think BoR may be insufficient on its own
I like the paper, but I would not treat BoR as the final answer to retrieval evaluation. There are at least a few important caveats.
First, not every task only needs one good item. Some tasks genuinely require synthesis across multiple pieces of evidence. In those cases, a success-style view can understate the need for broader retrieval.
Second, retrieval usefulness is not binary. Two chunks may both count as “relevant,” while one is far more actionable, concise, or decision-useful for the model.
Third, prompt organization still matters. A noisy bundle that is carefully structured may perform better than a slightly cleaner bundle that is poorly ordered or badly formatted.
Fourth, the model itself matters. Different LLMs have different tolerance for clutter, different long-context behavior, and different tool-use reliability. A retrieval policy that pollutes one model may be acceptable for another.
Fifth, and this is especially relevant for large tool libraries, BoR tells us more about selectivity than about usability. A shortlist can still look meaningfully better than random and yet be too crowded, too overlapping, or too semantically messy for the model to use well.
So I would not use BoR in isolation. I would pair it with:
- downstream task accuracy,
- latency and token-cost analysis,
- tool-call quality,
- parameter correctness,
- and some explicit measure of prompt cleanliness or redundancy.
Still, even with those caveats, BoR contributes something important: it forces us to stop confusing coverage with selectivity.
How this changes evaluation practice for me
The biggest practical shift is that I would now evaluate retrieval systems more like this:
- First, look at standard retrieval metrics. They still matter. You should ideally consider a bag-of-metrics approach, leveraging multiple complementary metrics.
Then ask:
- What is the random baseline at this depth?
- Is higher Success@K actually demonstrating skill, or just easier conditions?
- How much extra context did we add to get that gain?
- Did downstream answer quality improve, stay flat, or worsen?
- Are we making the model reason, or merely making it read more?
For agents, I would go even further:
- How many tools were visible at decision time?
- How much overlap existed between candidate tools?
- Could the system have routed first and selected second?
- Was the model asked to choose from a clean shortlist, or from a crowded menu?
That is a more realistic evaluation setup for the kinds of systems many teams are actually deploying.
The broader lesson
The main lesson I took from the ICLR poster is much broader than a single new metric: it’s that LLM system quality depends heavily on the cleanliness of the context we construct around the model. That has consequences across the Agentic stack:
- retrieval,
- memory,
- tool routing,
- agent planning,
- multi-step workflows,
- and even UI design for human-in-the-loop systems.
The best LLM systems will be the ones that expose the right information, at the right moment, in the smallest clean bundle that still supports the task. This is the nature of what good context engineering looks like.
Final thought
For years, retrieval was mostly about finding needles in haystacks. For LLM systems, that is no longer enough. Now the job is also to avoid dragging half the haystack into the prompt along with the needle.
That is why I think the BoR idea matters and is so impactful. It gives practitioners a better language for a real production problem: how to measure when useful context has quietly turned into polluted context. And once you start looking at your systems that way, a lot of familiar agent failures begin to make much more sense.
BoR does not directly measure what the model “feels” when reading a prompt, but it does tell us when retrieval is ceasing to be meaningfully selective and starting to resemble brute-force context stuffing. In practice, that is often exactly the regime where LLMs begin to read more junk, reason less cleanly, and perform worse downstream.
More broadly, I think this points to an important emerging sub-field: developing better metrics for measuring LLM system performance in realistic settings, not just model capability in isolation. We have become reasonably good at measuring accuracy, recall, and benchmark performance, but much less good at measuring what happens when a model is forced to reason through cluttered, overlapping, or weakly filtered context.
That, to me, exposes a real gap. BoR helps measure selectivity relative to chance, which is valuable. But there is still a missing concept around what I would term cognitive overload: the point at which a model may still have the right information somewhere in view, yet performs worse because too many competing options, snippets, tools, or cues are presented at once. In other words, the failure is no longer just retrieval failure. It is a reasoning failure induced by prompt pollution.
I suspect that better ways of measuring this kind of cognitive overload will become increasingly important as agentic systems grow more complex. The next leap forward may not just come from larger models or bigger context windows, but from better ways of quantifying when the model’s working context has crossed the line from useful breadth into harmful overload.
Inspired by the ICLR 2026 blogpost/article, The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection.
Disclaimer: The views and opinions expressed in this article are solely my own and do not represent those of my employer or any affiliated organisations. The content is based on personal reflections and speculative thinking about the future of science and technology. It should not be interpreted as professional, academic, or investment advice. These forward-looking perspectives are intended to spark discussion and imagination, not to make predictions with certainty.
