Multi-Agent Arena: Insights from London Great Agent Hack 2025

Individuals are going to make use of increasingly more AI. Acceleration goes to be the trail ahead for computing. These basic developments, I fully consider in them.

Jensen Huang. Nvidia CEO

I had the wonderful alternative to take part within the Great Agent Hack 2025, hosted by Holistic AI at UCL[2, 3]. The hackathon was structured round three large challenges: Agent Iron Man, Agent Glass Box, and Dear Grandma, every representing a unique philosophy of agentic AI. These weren’t simply inventive names for handy classes; they mirrored three pillars of how we take into consideration brokers as we speak: robustness, transparency, and person security (of anybody, together with your grandma 😄). Being immersed in that setting for a weekend was a sort of reset button for me: it was energising, it jogged my memory why I take pleasure in working on this subject, and it left me genuinely impressed to continue learning and constructing, even when there’s by no means sufficient time to discover the whole lot that’s taking place round AI.

On this hackathon, greater than 50 initiatives have been developed throughout three tracks. The main focus of this text will probably be on key moments from the occasion and a handful of initiatives that stood out to me personally, whereas recognizing that each workforce contributed one thing precious to the broader dialog on constructing sturdy and reliable brokers. For readers who need to discover the complete vary of concepts, the whole gallery of 51 submissions is obtainable right here: https://hai-great-agent-hack-2025.devpost.com/project-gallery?page=1 [4].

Determine 1. Official leaflet and my T-shirt from The Nice Agent Hack 2025. Picture by the writer.

Hosted by the UCL Centre for Digital Innovation (CDI), we spent the weekend in some actually distinctive areas in East London, the sort of place the place you stroll previous the Orbit Tower (the pink sculpture from the 2012 Olympics) after which code underneath a rotating floating Earth contained in the constructing (Determine 2). London was already lined in Christmas lights in all places you walked, so transferring between the hackathon and the town felt like stepping between a analysis lab and a vacation postcard.

**Determine 2.** East London views: UCL East campus and the ArcelorMittal Orbit (additionally referred to as Orbit Tower) (left), and the floating Earth set up contained in the UCL Centre for Digital Innovation (proper). Pictures by the writer.

In whole, the hackathon introduced collectively greater than 200 members and roughly 25 completely different awards throughout every kind of classes. Groups weren’t dropped in chilly: earlier than the weekend that they had entry to tutorials, instance notebooks, and different assets that helped them put together [5], select a observe, and hit the bottom working as soon as the clock began. As deliverables, every workforce was anticipated to submit a public GitHub repository, document a brief demo, and create a poster or slide deck to current their resolution to the jury, which made it a lot simpler to grasp the complete workflow and real-world potential of each undertaking.

The jury got here from a surprisingly numerous mixture of organisations: Holistic AI (the organiser), the UCL Centre for Digital Innovation (CDI), AWS, Valyu, NVIDIA, Entrepreneurs First, and others, together with firms within the expertise and concepts on show. They chose the winners for every of the three fundamental tracks, but additionally handed out a complete constellation of thriller and particular awards that celebrated rather more than simply essentially the most technically superior resolution.

Amongst these particular awards there was a Courageous Soldier-style prize for the workforce that confirmed true resilience and stored going even when their teammates began disappearing, actually leaving one soldier standing; a Greatest Pitch award, as a result of promoting your concept can be a part of getting the job performed (particularly since technical professionals are likely to wrestle a bit with this); and a Highest Useful resource Utilization prize for the groups that actually leaned into AWS and squeezed each final spark out of the cloud. These and different award classes are summarised on the hackathon web site [2].

Some of the curious issues concerning the weekend was the prospect to see NVIDIA’s extremely‑compact AI supercomputer up shut and even take a photograph with the long-lasting leather-based‑jacket setup to recreate the well-known Elon Musk × Jensen Huang “leather-based jacket second” [6] proven on the massive display screen (Determine 3). To make it even higher, a number of the brokers we have been making an attempt to interrupt within the Expensive Grandma problem have been truly working on comparable NVIDIA GPU {hardware}, so this tiny supercomputer was actually the mind behind the brokers that opponents have been attacking.

**Determine 3.** The total NVIDIA expertise: the leather-jacket photograph setup with the DGX Spark (left) and a close-up of the ultra-compact DGX Spark (proper). Photographs by the writer.

The Agentic Enviornment

As talked about firstly of this text, the guts of the weekend was structured round three tracks (Determine 4). Every one explored a unique query about fashionable AI brokers: tips on how to construct them so that they work, tips on how to make them clear, and the way to ensure they don’t go rogue.

Groups may decide whichever observe greatest match their use case, however in observe many initiatives naturally crossed observe boundaries; an indication of how keen individuals have been to study, join, and produce collectively completely different facets of the agent lifecycle (sure, the concept the extra tracks you be part of the higher your possibilities of profitable was floating round too, however we’ll skip that for now 😉).

**Determine 4.** The three tracks of the Nice Agent Hack 2025: *Agent Iron Man* (construct brokers that don’t break), *Agent Glass Box* (perceive agent behaviour), and *Dear Grandma* (assault like a pink workforce, defend like a guardian). Picture by Creator.

Monitor A. Agent Iron Man: Brokers that work, and final

This was the engineering actuality test observe. The purpose was to construct a high-performing, production-ready multi-agent structure with clear agent roles, instruments, and reminiscence wired collectively in a manner that might truly survive outdoors a hackathon.

Analysis targeted on issues that often solely harm you in manufacturing: efficiency (pace, latency, price), robustness (how the agent handles instrument failures, dangerous inputs, and edge circumstances), structure high quality (clear separation between brokers, protected instrument orchestration, wise fallbacks), and monitoring (observability, structured outputs, fundamental well being checks). Groups have been additionally anticipated to account for carbon footprint by favouring smaller or cheaper fashions the place doable and measuring power and token utilization, so the agent stays a conservative, accountable use of compute.

This observe can be a small style of what’s coming as brokers change into extra broadly used and techniques develop extra advanced, with many providers speaking to one another whereas nonetheless needing to satisfy tight latency and price targets.

Between the initiatives, one which caught my eye was FairQuote [4]: an clever automobile‑insurance coverage underwriting system that makes use of an orchestrator agent plus specialised consumption, pricing, and coverage brokers that coordinate to gather knowledge, assess danger, calculate premiums, and generate explainable insurance policies in a single dialog; architecturally, it factors towards the subsequent wave of multi‑agent enterprise workflows, the place robustness, clear duties, and powerful observability matter simply as a lot because the underlying fashions.

Underwriting is an effective instance as a result of it’s one of many hardest and most business-critical issues in insurance coverage. It sits on the intersection of regulation, actuarial science, and buyer expertise: each choice about accepting a danger, pricing it, or making use of exclusions passes via this course of. When underwriting is sluggish or opaque, prospects get annoyed, companions lose belief, and insurers danger mispriced portfolios and regulatory scrutiny. When it really works properly, it quietly retains the system secure, allocating capital effectively, defending the steadiness sheet, and supporting honest pricing throughout segments.

So, on this observe, it was nice to see not solely stable engineering, but additionally the actual issues groups tackled: underwriting, end-to-end claims dealing with, fraud investigation, and even emergency-services dispatch, the place multi-agent techniques coordinated triage and choice help in actual time. Even when the weekend outputs have been nonetheless demos, they pointed towards the multi-agent patterns, safeguards, and monitoring that can matter as comparable architectures transfer from hackathon tables into dwell enterprise environments.

Staff instrument selections lined up intently with the hackathon’s really useful stack: AWS AgentCore with the Strands Brokers SDK for orchestration, Amazon Nova and different Bedrock-hosted fashions (smaller SLMs to remain frugal), and analysis frameworks like AgentHarm [7]. The latter enables you to take a look at whether or not an LLM agent can appropriately sequence artificial instruments equivalent to dark-web search, internet scrapers, electronic mail senders, fee or bank-transfer capabilities, and code or shell instruments; so you possibly can measure each its robustness to jailbreaks and the way succesful it stays at executing multi-step dangerous workflows as soon as security obstacles are bypassed.

Monitor B. Agent Glass Field: Brokers you possibly can see, and belief

The transparency observe targeted on making agentic techniques explainable, auditable, and interpretable for people and organisations. Groups have been requested to construct brokers whose reasoning, reminiscence updates, and actions could possibly be traced and inspected in actual time, as a substitute of remaining opaque black containers. In observe, the initiatives fell into a number of households: observability pipelines, explainability instruments, governance and security layers and professional‑discovery or traceability instruments.

For me, one of many initiatives that greatest captured the thought of a “glass field” was GenAI Explainer. Everyone knows text-to-image diffusion fashions may be highly effective however dangerous: conventional diffusion techniques have already been proven to breed societal biases [8], and even newer fashions like FLUX.1 can nonetheless replicate patterns of their coaching knowledge [9] whereas providing virtually no perception into why a selected picture seems the best way it does. On the hackathon, the GenAI Explainer workforce tackled this by wrapping FLUX.1 with an explainability layer that permits you to see how every phrase or section of a immediate influences the generated picture, audit outputs for model, authorized, or security compliance, and iteratively refine prompts whereas watching the impression dwell, with each technology step tracked. In observe, they turned diffusion from a black field into one thing a lot nearer to a glass-box, auditable workflow.

Ultimately, Monitor B was a reminder that algorithmic transparency is now not non-compulsory: authorized and danger groups more and more want to point out that automated selections are explainable and never biased, and the sort of ‘glass‑field’ pondering behind initiatives like GenAI Explainer is one thing we must always carry into each agentic utility we construct.

On this observe, workforce instrument selections mixed tracing platforms equivalent to LangSmith or LangFuse, AWS observability providers like CloudWatch, X‑Ray, or Bedrock monitoring, and analysis instruments like AgentGraph [10] (changing traces into interactive data graphs), AgentSeer [11] (constructing motion graphs and doing failure/vulnerability evaluation), and the Who_and_When failure‑attribution [12] dataset to analyse and visualise agent traces in depth, to say just some.

Monitor C. Expensive Grandma: Brokers that keep protected, and behave

On this observe, groups got seven secret LLM brokers 🐺🦊🦅🐻🐜🐘🦎, every represented by an animal, and the mission was to interrupt them, perceive them, and establish them. These seven hidden “stealth brokers” symbolised completely different behaviours, strengths, and assault surfaces that groups wanted to uncover. The problem was to construct a pink‑teaming framework that might assault any of the seven dwell animal‑agent endpoints utilizing the API supplied by the occasion organisers, backed by NVIDIA powered infrastructure.

Within the hackathon, every “animal” agent was a dwell AI system uncovered via a single API service, with completely different routes for every animal. Groups may ship prompts to those animal‑particular routes and observe how the brokers behaved in actual time, every with its personal character and capabilities, which helped pink‑teamers design focused assessments and assaults.

Determine 5. Instance of a jailbreak take a look at in opposition to a number of the “animal” brokers: in entrance of a DAN‑fashion immediate, every mannequin responds with a playful refusal and a constant security message, revealing each their shared guardrails and their distinct personalities.

Monitor C wasn’t restricted to the seven “animal” brokers behind the API; attacking industrial techniques like ChatGPT, Claude, or Gemini was additionally allowed so long as groups handled it as a part of a scientific safety evaluation.

On this manner, the answer ought to analyse, assault, and clarify AI agent vulnerabilities, carry out behavioural forensics, and perceive why the assault works.

The jailbreaking lab workforce use a two‑step course of the place they first constructed an assault library of confirmed jailbreak prompts, primarily based on methods reported within the literature equivalent to Base64 obfuscation, CSS/HTML injection, and different immediate‑stage tips. Second, they utilized a genetic algorithm to mutate and enhance these prompts: at any time when an assault from the first step partially succeeded, the algorithm would tweak it (altering wording, including context, combining two prompts, or additional obfuscating directions) in order that profitable variants have been stored and weak ones have been discarded. Over time, this evolutionary search produced stronger and stronger adversarial prompts and even uncovered totally new methods to interrupt the brokers.

HSIA was one other standout undertaking that pushed these concepts into the robotics world. As an alternative of attacking the animal brokers, they focused a Visible–Language–Motion (VLA) robotic system and confirmed how its notion could possibly be corrupted on the semantic stage. The pixels within the picture stayed precisely the identical; what modified was the interior caption generated by the mannequin. With delicate, fastidiously crafted perturbations, the VLA system may flip from “I see a bottle within the picture” to “I see a knife within the picture,” though no knife was current, main the robotic to behave on a false perception about its setting. Their work highlights that multimodal techniques may be compromised with out touching the uncooked picture, exposing a essential vulnerability for next-generation robotic AI.

Classes Realized

If I needed to summarise what this hackathon taught me, it might be:

Be a Courageous Soldier. Perseverance issues greater than competitors. It’s not about beating others; it’s about staying resilient, adapting when issues break (as a result of they will), and delivering one of the best model of your concept. Occasions like this aren’t simply technical challenges; they’re alternatives to showcase your expertise and the sort of dedication firms genuinely worth.

Put together forward of time. The groups that did properly weren’t essentially essentially the most senior, they have been those who arrived already figuring out the format, the expectations, the analysis standards, and had gone via the tutorials and assets shared upfront.

Grasp the 5-minute pitch. That is essential. Evaluators and judges transfer quick. You may spend a number of days constructing one thing, however you solely get a couple of minutes to make them care. So, have a pitch prepared that explains the worth of your undertaking clearly, shortly, and in a manner that sparks curiosity. If these 5 minutes are nice, the judges will ask for extra. This is applicable equally to junior profiles and senior engineers (storytelling is a part of the job). I wrestle with this too; in actual life we often don’t have a lot time to show our concepts.

These Occasions Are Turning into Extra Significant Than Ever. These occasions are gaining extra curiosity yearly, and the organisers even doubled the variety of spots this 12 months, which exhibits how precious the expertise is. That’s why it’s so essential to take part provided that you actually need to be there and may commit your time and power.

Examine the sponsors. Earlier than the occasion, lookup the businesses concerned and take into consideration which of them is perhaps most considering your method. Tailor your pitch accordingly. Sponsors are usually not simply judges they’re potential collaborators, mentors, and even future teammates.

Sturdy Fundamentals Beat Shiny Fashions. One key takeaway from the hackathon is that profitable wasn’t about utilizing the latest or most hyped fashions. The highest groups didn’t succeed as a result of they relied on the most important or flashiest architectures, they excelled as a result of they constructed sturdy options on prime of stable, well-understood methods: genetic algorithms, sturdy diffusion fashions, between different. The actual differentiator was how creatively they mixed these foundations with agentic methodologies, intelligent analysis setups, and sensible engineering to deal with persistent challenges.

Collaborative Innovation Accelerates Progress. The occasion highlighted how cross-disciplinary collaboration between academia, business, and AI governance consultants can considerably strengthen each AI growth and governance frameworks. Even members who weren’t in technical roles contributed precious concepts grounded in actual issues from their very own domains, bringing views that pure engineering alone can’t present. It’s additionally an important alternative to attach with individuals outdoors your ordinary technical bubble, increasing not simply your community, however the best way you consider the impression and functions of AI.

Lastly, a much bigger reflection: brokers are evolving quick, and with that comes new architectural challenges, security issues, and duties. These are usually not hypothetical issues of the long run, they’re taking place proper now. Being accountable with AI functions shouldn’t be a hype-driven slogan; it’s a part of the day by day job of any AI or knowledge science skilled.

Conclusions

These occasions are quietly shaping how we take into consideration AI governance. If you put highly effective agentic techniques underneath time stress and in messy, life like situations, you’re pressured to confront unpredictable behaviour head-on. That’s the place the actual studying occurs: how can we steadiness fast innovation with belief and security? How can we design analysis frameworks and guardrails that allow us transfer quick with out shedding management? This hackathon didn’t simply reward intelligent fashions, it rewarded considerate governance.

And whereas there are many AI occasions popping up in all places, this is likely one of the few you must actually control, the type that genuinely helps you develop, exposes you to real-world challenges, and reminds you why it’s value staying curious and maintaining your expertise sharp.

References

References so as of look:

[1] “NVIDIA CEO Jensen Huang kicks off CES 2025. The Future is Right here!” SupplyChainToday, 2025. Link.

[2] Nice Agent Hack 2025: Holistic AI x UCL. Out there at: https://hackathon.holisticai.com/ (accessed November 22, 2025).

[3] Valyu AI. (2025). The Nice Agent Hack 2025: Agent Efficiency, Reliability and Valyu-Powered Retrieval. Retrieved from https://www.valyu.ai/blogs/the-great-agent-hack-2025-agent-performance-reliability-and-valyu-powered-retrieval

[4] Nice Agent Hack 2025. “Undertaking gallery — Nice Agent Hack 2025: Construct and take a look at clear, sturdy, and protected AI brokers for actual‑world impression.” Devpost. Out there at: https://hai-great-agent-hack-2025.devpost.com/project-gallery?page=1.

[5] Holistic AI. (2025). Hackathon 2025 [Source code]. GitHub. https://github.com/holistic-ai/hackathon-2025 (Final accessed: November 30, 2025)

[6] Elon Musk Surprised by Jensen Huang’s DGX Spark Reward. (n.d.). YouTube Shorts. https://www.youtube.com/shorts/l7x_Tfrbubs

[7] Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, Z., Fredrikson, M., Winsor, E., Wynne, J., Gal, Y., & Davies, X. (2024). AgentHarm: A benchmark for measuring harmfulness of LLM brokers. arXiv. https://arxiv.org/abs/2410.09024

[8] Tiku N., Schaul Okay. and Chen S. (2023, November 01). That is how AI picture turbines see the world. Washington Publish. https://www.washingtonpost.com/technology/interactive/2023/ai-generated-images-bias-racism-sexism-stereotypes/ (final accessed Aug 20, 2025).

[9] Porikli, S., & Porikli, V. (2025). Hidden Bias within the Machine: Stereotypes in Textual content-to-Picture Fashions. Out there at: https://openreview.net/pdf?id=u4KsKVp53s

[10] Wu, Z., Cho, S., Munoz, C., King, T., Mohammed, U., Kazimi, E., Pérez-Ortiz, M., Bulathwela, S., & Koshiyama, A. (2025). AgentGraph: Hint-to-Graph platform for interactive evaluation and robustness testing in agentic AI techniques. Holistic AI & College School London.

[11] Wicaksono, I., Wu, Z., Patel, R., King, T., Koshiyama, A., & Treleaven, P. (2025). Thoughts the Hole: Evaluating Mannequin- and Agentic-Stage Vulnerabilities in LLMs with Motion Graphs

[12] Zhang, S., Yin, M., Zhang, J., Liu, J., Han, Z., Zhang, J., Li, B., Wang, C., Wang, H., Chen, Y., & Wu, Q. (2025). Which agent causes job failures and when? On automated failure attribution of LLM multi-agent techniques (arXiv Preprint No. 2505.00212).

Source link

Three OpenClaw Mistakes to Avoid and How to Fix Them

I Stole a Wall Street Trick to Solve a Google Trends Data Problem

Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)

We’re Seeing More Signals of AI Job Disruption (Including a “Stop Hiring Humans” Campaign)

What Makes a Language Look Like Itself?

AI-kompanjoner använder manipulativa taktiker för att förlänga konversationer

Torchvista: Building an Interactive Pytorch Visualization Package for Notebooks

Pattie Maes receives ACM SIGCHI Lifetime Research Award | MIT News

Most Popular

Meet the early-adopter judges using AI

Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case

Data Culture Is the Symptom, Not the Solution

Our Picks