Towards a science of scaling agent systems: When and why agent systems work?

24-02-2026

Speaker

Yubin Kim is a PhD student at the MIT Media Lab’s Personal Robots Group. His work focuses on multi-agent systems for healthcare, and wearable intelligence. Yubin is the first author of MDAgents, a tiered multi-LLM collaboration system designed for medical decision-making, accepted as an oral paper at NeurIPS 2024. His broader research explores medical hallucinations, agentic scaling laws, proactive clinical reasoning, and LLMs for wearable predictions, including the ongoing projects Health-LLM, Proactive Agent and Medical Hallucination. He has been working with Google Research, and Google DeepMind, and his work spans machine learning, health AI, robotics, and agentic systems. Ultimately, his research aims to create trustworthy personal health agents capable of longitudinal reasoning and real-world clinical integration. Homepage: https://ybkim95.github.io/

Abstract

Agents, language model-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored. We address this by deriving quantitative scaling principles for agent systems. We first formalize a definition for agentic evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, model capability, and task properties. We evaluate this across four benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. With five canonical agent architectures (Single-Agent and four Multi-Agent Systems: Independent, Centralized, Decentralized, Hybrid), instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations. We derive a predictive model using coordination metrics, that achieves cross-validated R²=0.524, enabling prediction on unseen task domains. We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed ~45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2x, while centralized coordination contains this to 4.4x. Centralized coordination improves performance by 80.8% on parallelizable tasks, while decentralized coordination excels on web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, every multi-agent variant degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations. Out-of-sample validation on GPT-5.2 achieves MAE=0.071 and confirms four of five scaling principles generalize to unseen frontier models.

Speaker

Abstract

Video